Now that you’ve read What is Crawl Budget and When Should You Worry About It, you understand what crawl budget is and why it may be something your business needs to think about. Now, you probably want to know how to optimise for it. In the second part of this 2-part blog series, we’ll offer some expert advice to help you understand what you need to know.
There are a number of different aspects of your website performance and structure that can cause crawl budget issues. Generally speaking, there are 4 key ways in which you can maximise crawl budget for your website:
An XML sitemap is a file on your website that feeds search engines data on the pages of the site as well as the crawl priority or hierarchy of site content - in other words it tells search engine crawlers what you want them to look at on your site.
Whilst having an XML sitemap does not guarantee that all your pages will get crawled and indexed, it does increase your chances, particularly if your site architecture and internal linking are not optimal.
A sitemap should be created at the root level of your site (usually mysite.com/sitemap.xml) and you should submit it to major search engines e.g. Google via Google Search Console and Bing via Bing Webmaster tools.
This is a pretty self explanatory one. The faster your pages and links load, the faster a crawler can crawl and index them. By reducing the delay between when a Googlebot visits a page and when it can move to another, you are increasing the chances of that Google bot crawling and indexing more pages.
Ensure that your site content loads quickly by:
When it comes to site performance, your web hosting service makes a huge difference. It can be tempting to go with the cheapest possible option for web hosting, particularly for a new website, but don’t forget to upgrade as your site begins to get more traffic.
Whilst a basic shared hosting package may be fine to start out with, once you start generating a lot of traffic to your site, you will want to consider a more robust hosting option like VPS or dedicated hosting.
Images that are poorly optimized will tank your site’s page load speed. URLs that load oversized images will often be scaled by the browser. Scaling images in the browser is bad for performance as it takes extra CPU time and the user ends up downloading data they don't use.
You should have an upper limit for images on your site - the sweet spot here varies by site type (e.g. e-commerce sites will want to prioritise image quality a little more) but here at adaptive we recommend using an upper target of 200kb max.
All images should be compressed and optimised for web before uploading to your site. Photoshop and other common photo editing tools have "Save for web" options or similar.
Using a Content Delivery Network (CDN) is a good option, particularly if you have a lot of traffic from a wide geographical range - a CDN is essentially a global network of servers on which you can cache your site content. When a user requests files from your site, that request will be routed to the closest server.
A HTTP request is made each time an element on your page is downloaded e.g. an image, stylesheet, script, etc.
Some CMSs and themes are notorious for bloating your page templates with scripts. Each additional tracking tool, 3rd party integration or other “plugin” will add to this bloat.
Make sure that your site’s code is as lightweight and efficient as possible and only add bulk (e.g 3rd party plugins) where they add significant value to your user.
However they also add to the amount of HTTP requests that your site needs to make each time a user views a page.
You can reduce bloat by minifying and combining files where appropriate.
Minifying files involves optimisations like removing unnecessary commas, spaces, other characters, code comments, stale code and formatting.
Asynchronous loading involves downloading and applying a page resource in the background, independently of other resources rather than one at a time, in the order they appear on the page.
Loading files asynchronously can improve pagespeed by ensuring the browser loads your page from top to bottom.
When a user visits your site, their browser cache will save information and data, including images and HTML that are necessary for that user to see your site. This means that the next time that user visits your site, their browser can load the page without having to send another HTTP request to the server.
There are different ways to set up caching depending on how your website has been built.
Lazy loading is the practice of delaying load or initialization of resources or objects until they’re actually needed e.g. content that is “below the fold”.
By implementing lazy loading which under-emphasizes content below the fold, the user will not need to wait to access the page and initial page load times will be significantly reduced.
Again, lazy loading may be implemented in different ways depending on how your site has been built.
Redirecting one URL to another is appropriate in many situations. However, if redirects are done incorrectly, it can lead to disastrous results. Two common examples of improper redirect usage are redirect chains and loops.
Long redirect chains and infinite loops lead to crawlers wasting a lot of crawl budget and eventually just giving up on crawling that chain (Google says they will give up after 5 redirects in a row)
Review your site for redirects with a tool like Screaming Frog and reduce redirect chains by redirecting all links in each chain to the final destination url separately.
Googlebot and other search engine crawlers find new content by following html links from pages they have already crawled on your site. You can make it easier for Google to quickly and efficiently find and crawl all of your pages by ensuring that you have:
A logical site structure has many, many benefits for SEO. It helps Google find all your pages it helps to spread link equity (page authority) throughout your site and; most importantly in the context of this article, it ensures that crawlers can efficiently find all of the pages on your site without working too hard.
There are 5 steps to ensuring a logical site architecture:
Internal links are links that go from one page on a domain to a different page on the same domain. This includes:
A. Main navigation links (e.g. links in your main menu)
B. Secondary or other navigation links (e.g. sidebar menu links within certain sections of your site)
C. Hyperlinks within your content
Rather than relying on linking related content with hyperlinks alone (c) you should ensure that:
Websites with a lot of content often use pagination so that they can quickly and easily provide content to users. For example, a category landing page or a product list page may be split out into multiple pages, each with a manageable amount of content or links (e.g. links to products or blog posts).
Issues arise with pagination when key content is difficult for robots to reach e.g. when:
You can avoid paginated-related crawling issues by:
Google does not look at “pages” in the same way as many users do. A googlebot looks at URLs and unless you tell it to behave otherwise, it will treat each unique URL that it finds as a unique page with content worthy of crawling and indexing. However, very often, certain CMSs will create a lot of unnecessary URLs that do not necessarily contain unique, useful content for the end user. These “unnecessary URLs” should be minimised. They come in many forms including:
Navigation Filter URLs are designed to narrow down items within a site’s listing page and display this “filtered” information to users. For example in an eCommerce site a user can apply a colour filter to a products listing page, which will narrow down the list of products on the page, while appending parameters to the base URL.
If not managed correctly, this type of filtering functionality can cause serious crawl budget issues by creating near duplicates of important pages. If these filtered urls are accessible to robots, and particularly if those filters can be combined, then crawlers will keep finding endless new filter combinations, leading to thousands more pages to crawl. Ultimately, this will lead to the bots to waste crawl budget that should be reserved for genuine, unique URLs.
Adding a canonical tag can stop the indexing of a page with a dynamic URL but it will not stop it from being crawled. Therefore, it is best practice to ensure that URLs with filters that do not result in unique content with organic traffic potential are tagged with a nofollow attribute. How you do this will depend on your CMS and other site details so speak to your developer to identify the best approach for you. Specific filter parameters can also be blocked via the parameter tool in Google Search Console.
Tracking parameter (UTM parameters) are short strings of text that you add to URLs (e.g. in adverts or affiliate links) to send data on their usage back to 3rd party tools e.g. to help you track the performance of a specific ad campaign in Google Analytics.
If you use a lot of different tracking parameter URLs (and combinations) to drive traffic to your site, then you will end up with what Google interprets as multiple different versions of the same page. In the worst case scenario Google will see all of these as individual pages that are worthy of crawling and indexing.
Again in this scenario, you should ensure that URLs with filters that do not result in unique content with organic traffic potential are tagged with a nofollow attribute and specific filter parameters can also be blocked via the parameter tool in Google Search Console.
While Google may downplay the significance of crawl budget for most site owners, in our experience it is something that you should monitor, especially if your business maintains an e-commerce store or another type of large and potentially complicated website.
To ensure that your site’s indexability is not negatively affected by crawl budget issues, you should ensure that it is fast, logically structured with comprehensive internal linking and that it is not creating excessive numbers of indexable duplicate content via dynamic URLs.
If you suspect that Google is finding it difficult to find, crawl and rank some of your content, feel free to get in touch to discuss an SEO audit.
Activation & Performance
June 16, 2022
It’s launch day for your website & you've already taken care of the basics from 301 redirects to performance optimization. Time for a post-launch check up!
Activation & Performance
June 16, 2022
Before your site goes live, it's important to review your staging site performance to ensure pages load quickly & can be crawled efficiently. Read how in this post.