Lesson 3	Search Engines
Objective	A Search Engine creates and maintains its Database of Sites

How Search Engines create Database Index

Search engines create database indexes to optimize the search process and speed up retrieval of results. The indexing process involves the following steps:

Crawling: The search engine crawls the web and gathers information about the content of web pages, such as keywords, metadata, links, and other relevant data.
Parsing: The search engine parses the content of the web pages and extracts the relevant information, such as text content, images, and other media.
Tokenization: The search engine breaks down the extracted information into smaller units, such as words or phrases, and assigns each unit a unique identifier.
Stemming: The search engine applies stemming algorithms to the tokens to normalize them and reduce them to their root form. This allows the search engine to match variations of the same word, such as "run," "running," and "ran."
Stop word removal: The search engine removes common stop words, such as "the," "and," and "a," which do not add meaning to the search query.
Indexing: The search engine creates an index of the tokens and their corresponding web pages. The index contains a list of the tokens, along with their frequency and location in the web pages.
Ranking: The search engine applies a ranking algorithm to the indexed pages to determine the relevance and order of the results for a given search query.

Once the index is created, the search engine can quickly retrieve and display the relevant results for a given search query, based on the indexed information. The index is typically stored in a database, which is optimized for fast retrieval and search performance.

Maintain Database of Sites

In the previous module, we discussed the central characteristic of search engines that makes them different from directories. Search engine data is compiled by computer programs called robots or spiders that search the Web (and some search services search other areas of the Internet, as well) for documents, index them, and then store the results in a database.

Automated robot or spider programs read information day after day from websites
Information is stored and indexed in the search service's database
Compose a search query from keywords and symbols
The search engine searches the service's database with its software
Matches or hits are then assembled into a list of search engine result Sets

Operations of Search Engine and Content Relevance Hypothetically, the most relevant search engine would have a team of experts on every subject in the entire world, a staff large enough to read, study, and evaluate every document published on the web so they could return the most accurate results for each query submitted by users. The fastest search engine, on the other hand, would crawl a new URL the very second it's published and introduce it into the general index immediately, available to appear in query results only seconds after it goes live. The challenge for Google and all other engines is to find the balance between those two scenarios: To combine rapid crawling and indexing with a relevance algorithm that can be instantly applied to new content. In other words, they are trying to build scalable relevance. With very few exceptions, Google is uninterested in hand-removing specific content. Instead, its model is built around identifying characteristics in web content that indicate the content is especially relevant or irrelevant, so that content all across the web with those same characteristics can be similarly promoted or demoted. This course frequently discusses the benefits of content created with the user in mind. To some hardcore SEOs, Google's "think about the user" is unusual. They would much prefer to know a secret line of code or server technique that bypasses the intent of creating engaging content.

Focus on creating relevant Content: While it may be strange, Google's focus on creating relevant, user-focused content really is the key to its algorithm of scalable relevance. Google is constantly trying to find ways to reward content that truly answers users' questions and ways to minimize or filter out content built for content's sake. While this book discusses techniques for making your content visible and accessible to engines, remember that means talking about content constructed with users in mind, designed to be innovative, helpful, and to serve the query intent of human users.

The following series of images shows you the sequence of operations:

1) Automated robot or spider programs read information day after day from websites that are linked to the site they are reading from.

2) Information is stored and indexed in the search service's database

3) Compose a search query from keywords and symbols that restrict or expand a search and submit the query to the Search Engine.

4) The search engine searches the service's database with its software for matches to your search query.

5) Matches or hits are then assembled into a list of search resultes — 5) Matches or hits are then assembled into a list of search results.

Robots and Spiders

Robots are also called spiders or crawlers. Most people use the terms Web index, search engine, and search service interchangeably to refer to a site or service that allows you to define a search query that will retrieve specific information online. IN 2018 there are 4 primary search engines. Google, Bing, Yahoo, duckduckgo.com. The search engines listed below existed during the dotcom era and are no longer being used.
When people refer to sites such as AltaVista or Excite as search engines, they are not exactly correct. These sites are actually commercial services that provide you with an interface and a search engine (the software that actually searches the database) with which to search a database of Web documents (or portions of Web documents) Each commercial service has its own search engine searching software and indexing robot. The combination of a robot-generated database and a search engine is also referred to as a Web index.
Although it may seem that a search engine will always overpower a directory through the sheer size of its automated database, there are a couple of limitations of individual search engines that you should know about, the percentage of all Web documents that are searched, overlap between search engine services, and how they deal with synonyms and homonyms.

As modern search engines evolved, they started to take into account the link profile of both a given page and its domain. They found out that the relationship between these two indicators was itself a very useful metric for ranking webpages.

Domain and Page Popularity

There are hundreds of factors that help engines decide how to rank a page. In general, those hundreds of factors can be broken into two categories: 1) relevance and 2) popularity or "authority". For the purposes of this demonstration you will need to completely ignore relevancy for a second. Further, within the category of popularity, there are two primary types:

domain popularity and
page popularity.

Modern search engines rank pages by a combination of these two kinds of popularity metrics. These metrics are measurements of link profiles. To rank number one for a given query you need to have the highest amount of total popularity on the Internet. This is very clear if you start looking for patterns in search result pages. Have you ever noticed that popular domains like Wikipedia.org tend to rank for everything? This is because they have an enormous amount of domain popularity.
Question: But what about those competitors who outrank me for a specific term with a practically unknown domain?
This happens when they have an excess of page popularity.