How Search Engines work?

Millions of people use search engine systems like Google, Yahoo, and others on daily basis. However, not many really understand how these engines work. Despite all search engine systems having different names and differences in search algorithms, they all have the same principle of work.
If we look at the information search online, we can break it into the following stages:
- gathering the information from site pages in the Internet
- indexing sites
- search by request and rendering the results
Let’s look at each one of these stages separately.
Gathering the information
As soon as you launch your site and let some search engine bot know about your new resource (using links to your site, adding it to the “add url” services, or other approaches), the bot will visit you and start going from page to page gathering the information (your text content, images, video, and so on). This process is called data gathering or crawling, which can happen not only when the site is first launched. Robot creates a schedule for your site, when it has to visit it next time, check old information and add new pages, if there are any.
It’s important the “conversation” between your site and the search engine robot is pleasant for both sides. It’s in your best interest that bot doesn’t stay on your site for a long time, in order to not load the server again, and at the same time it’s necessary that it correctly gathered all data from all needed pages. Robot wants to make this harvesting of information as quick as possible, so it can move on to the next website on his schedule (millions and millions of pages). In order to help this happen you need to make sure that your site is accessible, and there are no problems with site navigation (JavaScript and Flash menus are not recognized well by the bots yet). You have to make sure there are no dead pages (404 not found error pages), and do not make the bot go through the pages that are only allowed for the registered users and so on. Also, you have to remember, that search bots have the limitations of how deep they can reach into your site and how large the size of the scanned text can be (usually it’s 256 kB).
You can control the access of the search engine bot by modifying file robots.txt. Also, the sitemap.xml can help the robot, if for some reasons the navigation is not easily done.
Indexing
Robot can travel on your site for a long time, however it doesn’t mean that it will immediately appear in the search engine results. Site pages have to go through the stage called indexing – creation of the inverted index file for each page. Index is used for speeding up the search and usually has a list of words from the text and the information about them (positions in the text, weight, and so on).
After the indexation has happened, the site and pages appear in the main search engine results page and they can be found via key words (or key word phrases) present in your text. Indexing process usually happens pretty quickly after the robot pulls the information from your site.
Information Search
During the search, the first thing that happens is the analysis of the request entered by the user (request processing), which calculates the weight of each keyword.
Then, the search is done using the inverted index, where all the documents are located (search engine database), which are the most likely match to the given request. In other words, the similarity of the document is calculated, using this formula:
similatiry(Q,D) = SUM(wqk*wdk),
where similarity(Q,D) – is the similarity of the request Q to the document D:
wqk - the weight of the keyword in the request
wdk - the weight of the keyword in the document (formulating results)
The documents that have the closest match to the request end up in the search results.
Rendering
After the closest matching documents have been selected from the main collection, then they have to render in descending order, so the top results have the most useful results for the user. In order to achieve this, there is a very special rendering formula, that has a different shape in different search engines, but has the same main rendering factors:
- domain authority (PageRank, SERP)
- page weight
- request text relevancy
- external link relevancy
- many other rendering factors
There is a simplified rendering formula, which can be found in some optimizer forums:
Ra(x)=(m*Ta(x)+p*La(x))* F(PRa)
where:
Ra(x) – resulting match of the document to the request x,
Ta(x) – relevancy of the text (code) to the request x,
La(x) – relevancy of the links from other documents to the request x,
PR? – page authority index in relevance to ?,
F(PRa) – monotonic stable function, please note that F(0)=1, we can also allow that F(PRa) = (1+q*PR?),
m, p, q – some coefficients.
This means we have to know that when rendering documents both the internal and external factors are taken into account. Also we can break them into “dependent from the request factors” (relevancy of the document or link text) and “independent from the request”. Although, this formula shows very general explanation of the rendering algorithms in the search engine results, it answers many general questions for anyone starting in SEO or already in SEO field has about the search engines, and makes many things clear.
Before I completely forget – Happy Thanksgiving to You and Your Families!
Beck @ ProfitSEO.com
Similar Posts:
- What is NoIndex, Where to use it, and Why?
- What is Smart Internal Linking for SEO?
- Decent PageRank without External Links?
- Black Hat SEO observations and notes
- What is Cloaking and Why it is considered Black Hat SEO or spam?
Popularity: 46%

