Google 101: How Search Engines Crawl, Index and Rank Pages

Google 101: How Search Engines Work

"I only browse this blog for the articles" should work for this one...

According to a recent comScore study, more than 12 billion Internet searches are performed each month. This equates to roughly 4,500 searches per second! With over 65 percent market share, Google handles 2,900 searches every second. It’s no wonder most SEO professionals address Google’s algorithm in their optimization efforts.

The good news is that Bing, Yahoo! and almost all other search engines have the same goals and apply very similar processes to deliver search results to users.

Note: For advanced marketers or those in extremely competitive niche markets, studying and addressing search engine-specific algorithms may be beneficial. But for most of you, the terms “Google and “search engine” can be used interchangeably.

The Goal of Search Engines

The primary goal of a search engine company is to make money. Google, Yahoo! and Bing generate revenue mainly through pay-per-click (PPC) advertising. The second goal of search engines is to deliver their users the best results possible in terms of relevance and speed. Let me show you how Google accomplishes this.

How Search Engines Work:  Three Steps to Search Results

1. Crawling

Think of the web as a series of individual documents (webpages) linked together by strands not unlike a spider’s web. Although you may often use the term “website,” incorporating all the pages within, search engines see and scan each individual page on the Web independently. A website is simply a collection of related webpages linked together, using hyperlinks. Without these links, the Internet would serve no purpose, denying us (and search engines) any way to find or rank information!

How Does Google Find My Site?

Google spiders, also known as “Googlebots” crawl the entire World Wide Web, scanning each webpage (i.e., billions of documents) and exploring its hyperlinks, storing this data in one of several indexes. This process continues until the search engine spider has found, “read” and indexed virtually every page on the Internet! Therefore, a great way for Google to find your site is when it notices and explores links on other sites that point to yours.

You can also submit your websites to Google for crawling a couple of different ways:

You can also control Googlebot and limit or omit specified content from being crawled and indexed, using robots.txt file. See our “SEO Articles” category for more on site maps and search engine submission.

 

2. Indexing

Once a webpage has been crawled, Google parses out and stores the code from these pages in massive data centers (Google’s index), ensuring that data can be served up in just a fraction of a second! Google assigns a unique ID to each webpage, and even indexes the content of each page to identify precisely which terms it contains.

Stop Words

Google also uses “stop words” in its index to speed up search results. Stop words are generic or commonly used words that rarely help narrow a search. Common Google stop words include: I, a, about, an, and, are, as, at, be, by, for, from, how, in, it, is, of, on, or, that, the, this, to, was, what, when, where, who, will, with. This is important to note, as space is valuable in SEO and using words that search engines won’t “read” can consume extra space and offer little benefit.

The Supplemental Index

Google maintains a supplemental index, used to store sites suspected of SPAM, sites with duplicate content and those that are hard to crawl (size or structure issues). It’s important to avoid inclusion in the supplemental index, as these sites are said to rank poorly in search results.

 

3. Ranking

Google’s Algorithm

Upon receiving a search query, Google must first return only those results related to the query, and, second, rank these results in order of relevance and importance. This ranking process is known as a search engine algorithm. Each major search engine company maintains its own algorithm with the goal of producing the “best” (most useful and relevant) results.

Think of the ranking process as a filter: a search engine’s mission in producing a search results page is to start with (almost literally) all the pages of data on the Web, then filter these pages through a series of screening steps, and, finally, narrow the list to what the search engine determines to be the “best” list of results.

Although complex in its entirety, Google’s search engine ranking algorithm boils down to a few simple concepts that cover the essence of how Google ranks webpages:

Relevance

The first test a webpage must pass in getting ranking is relevance. For example, if someone searches for the term “vacations,” Google’s first task is to pull all webpages that include this exact term (referred to as a “keyword”). This is why On Page SEO (optimizing all the elements a webpage) is the first step in an SEO campaign: Your ticket to even be considered in search results is relevant content!

Note: you can go down this rabbit hole much more deeply and study Google’s algorithm in detail, but, for the purpose of describing relevance in this context, all you really need to understand is that the first step on the path to top search engine rankings is to have your desired keywords placed in the optimum locations in your pages. (See our small business SEO Articles for more information.)

Authority (also referred to as “Importance”)

Unlike relevance, authority is determined largely by back links (links residing on other webpages that point to the ranking page). You may be familiar with the terms “Google Page Rank” or “PR.” Google’s page rank system is, in essence, a measure of authority related to a given search term.

Authority (and, therefore, page rank) hinges upon links as a gauge of popularity. In simpler terms, Google’s authority algorithm functions by treating the Internet like a massive voting system/popularity contest, treating each link to your site like a vote: if website A’s subject matter is soccer and has 900 other soccer-related sites linking to it, and site B is also about soccer but has just 150 soccer-related links, then site A is judged to be the greater authority related to soccer and therefore ranks higher.

If you sell pet supplies and wish to obtain top search engine rankings for this term, you must obtain more back links from other pet supply-related webpages (ideally, larger than your page). As you can see, there’s an aspect of relevance here as well: links from sites in the same “link neighborhood” carry more “link juice” than unrelated (off-topic) sites.

Trust

The third phase of Google’s ranking algorithm is Trust or “TrustRank.” Trust is measured largely by how consistent/reliable a website is in providing accurate information to users, placing higher value on more established sites.

Time is a component of the trust algorithm as well. This means websites that have been around longer get more trust points in Google’s algorithm. Lastly, there are minor but notable trust ranking factors such as site load time, W3C code validation and link quality.

SPAM Filters

One of the primary reasons that SE algorithms are always changing is due to the need to address “Black Hat SEO” practices, also called “search engine spamming.” These are techniques deployed by site owners who try and “game” the system by gaining search engine traffic without increasing their rankings naturally, i.e., building authority (quality back links) over time. Google is aggressive about such sites — excluding them from search results maintains quality and user satisfaction.

Note: there are several negative ranking factors that may have adverse effects on your search engine rankings without your knowing. These include: duplicate content, purchasing/selling links, and more, all of which are covered in our Geek-Free Marketing Blog (SEO and Search Engines section).

Summary

Search engines employ ever-changing algorithms to accomplish their mission of providing accurate and timely results to users, earning them millions in PPC ad revenue in the process.

Although complicated, these algorithms are on our side, as those website owners who seek shortcuts often experience only short-term traffic. As long as your website provides quality, relevant content, popularity and trust, long-term TRAFFIC is sure to follow!