We always talk about Google, but in reality, the robot that takes care of the crawl of our sites is called Googlebot, and it is he who plays a fundamental role in determining how the content of our sites is positioned in Google’s SERPs.
In a nutshell, understanding how Google Bot works is the key to improving the visibility of each site. Let’s first see what Googlebot is. Most people also don’t know How to Optimize a Website for Googlebot. So today I will show you step by step method.
On a technical level, we can think of Googlebot not as a single server that scans the web, but for scalability needs, to all a work that is distributed among different machines around the world.
It is a spider that deals with crawling web pages on the internet to find new links to view and subsequently index in the Google database, as well as periodically checking whether updates or changes have been made to the contents.
Quick Site Optimizations for Googlebot
To facilitate the work of Google while crawling the site, some not too expensive interventions can be made, but they can make a difference. Let’s see immediately, which are the most interesting, but above all, practical approaches since a full book would probably need a whole book.
Just Enter a Robots.txt file with Criteria
In practice, the bot tries to read the robots.txt file present in the root of each site to understand if there are directives that prevent the crawling of specific pages, and this is precisely why having this file constitutes a best practice from an SEO perspective for any site web.
For example, by inserting the following syntax in robots.txt, we will indicate to any bot (Googlebot, Bingbot, DuckDuckBot, Baiduspider, YandexBot) not to scan the private page.html
On the other hand, we can also specify rules to prohibit access only to specific bots.
Many think that this situation does not affect their site. Still, it is always good to check if there are sections of little value that waste bots resources (crawl budget), such as private areas, archives by author, or other based on the type of site in the examination.
In these cases, we could exclude them, making the bot concentrate on analyzing far more important pages and obtaining the result of having a Google index updated more frequently on the areas of the site that interest us most.
At this point, a question will spontaneously arise: is it possible to identify a crawler among all site visits?
The answer is yes, and considering that each spider uses different IPs, the safest identification method is based on the User-Agent, that is, an acknowledgment string.
Trivially, reading the Apache logs, we can find lines like the following, to find out if and when a specific crawler has passed on our site.
188.8.131.52 - - [08/Jul/2019:23:40:23 +0200] "GET / HTTP/1.1" 200 1229 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
Enter a sitemap.xml when it makes sense
Entire books have been written on the importance of the sitemap, but you must always contextualize it. The site map performs a facilitation function for Googlebot as it provides the site structure with all the URLs that the webmaster considers relevant.
Without this file, the Big G crawler may not visit some pages, perhaps because they are not properly linked from other sections of the site. In light of these considerations, it is clear that having a sitemap for a 5-page showcase site with direct links from the homepage would be useless.
On the contrary, an e-commerce or a blog with daily contents would do well to have a sitemap.xml and maybe even a real HTML site map visible by the human user to better orientate himself within the structure of the site itself.
It should be stressed that having or not this .xml file is not in itself a reason for penalization, but indirectly it can lead to fewer results.
Optimize the Structure of Internal Links
We have already talked about it, but it should be stressed that it is essential to have a well-designed site architecture to be able to navigate Googlebot on every relevant page of the site.
In the various industry forums, we often hear that to have a good internal linking structure, each page must be reachable with a maximum of 3 clicks from the homepage.
Beyond the precise number, it is important to note how planning the site architecture is fundamental from the beginning, without leaving anything to chance.
In essence, this does not mean inserting hundreds of links already on the homepage, because then we would have a dispersion of link juice, that is the value passed from one page to another.
On the contrary, it is an invitation to the use of categories or, in any case, to a reasoned distribution of the links in various sections of the site to make sure that every single content has a reference from other parts.
Use the Nofollow Tag Sparingly
In order to preserve the crawl budget, in addition to working with the robots.txt file, we can simply insert the “nofollow” attribute to the links on the pages, as in the following example.
In this way, we are telling Googlebot not to scan the link page and thus avoid a waste of resources. On the other hand, we must be very careful, because we may have partial indexing problems in the case of important pages linked only on the landing page that we have just excluded.
For this reason, the rel = “nofollow” is fine for links to private areas or sections that we absolutely want to exclude from the Google index, but we must be careful in e-commerce sections or blogs because real disasters could happen.
To conclude, although the theme of crawling a website is often underestimated, it is instead important to intervene immediately with various optimizations to facilitate the work of Googlebot towards our site.
In fact, if SEO OffPage and OnPage have always been the privileged activities of those involved in search engine optimization, on the other hand, without having an optimal scanability, the results of our work will still be deficient.