If you submit a sitemap to Google (this is done via the Google Search Console), and you have a robots.txt file that conflicts with your sitemap, Google will raise errors from its crawler.
Hugo sites will automatically create a sitemap.xml file and include all site pages in the sitemap. I highly recommend submitting the link to your sitemap to Google to inform the crawler of your pages. The URL will be https://mysite.com/sitemap.xml where "mysite.com" is your site domain name.
In some cases, you may want to prevent some pages from being crawled. In my example, I am using a theme that creates additional URLs to the same pages based on categories and tags and I don't want these URLs added to the Google index.
Adding a robots.txt file to your site is very easy. The Hugo documentation does a good job of covering this, but in summary, there are two things to setup:
To enable robots.txt add the following to your site configuration file (this is the toml example).
enableRobotsTXT = true
This setting will cause Hugo to search for a robots.txt file in your layouts folder and if it finds it, Hugo will add it to your site. If a robots.txt file is not found, Hugo will generate one.
This is what the default robots.txt file looks like.
User-agent: *
This robots file is setup to allow all crawlers on all pages. For my example I want to block crawling on pages under the tags branch and categories branch. My robots file looks like this:
User-agent: *
Disallow: /tags/
Disallow: /categories/
To include a site specific robots.txt file in your Hugo site, add your custom robots.txt file to your site "layouts" folder. Hugo will pick it up and add it in the public folder when you run a build. The public folder is the build location for all the files in your static site.
When I deployed this new robots.txt file, Google Search Console raised an error because my sitemap.xml hax references to pages under the tags and categories paths. Paths blocked by robots.txt should not be included in the sitemap.xml file. The solution is to customize the sitemap.xml file to exclude these paths.
To customize the sitemap that Hugo generates, you need to add the Hugo sitemap template to your layouts folder (just like the custom robots.txt). You can get the Hugo sitemap template here.
Both the robots.txt and the sitemap.xml files are run through the Hugo templating engine during the build so you can add template code to them. To exclude the tags and categories paths from the Sitemap I setup my sitemap.xml in my site's layouts folder as follows.
{{ printf "<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"yes\" ?>" | safeHTML }}
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
{{ range .Data.Pages }}
{{ if not (or (hasPrefix .RelPermalink "/tags") (hasPrefix .RelPermalink "/categories")) }}
<url>
<loc>{{ .Permalink }}</loc>{{ if not .Lastmod.IsZero }}
<lastmod>{{ safeHTML ( .Lastmod.Format "2006-01-02T15:04:05-07:00" ) }}</lastmod>{{ end }}{{ with .Sitemap.ChangeFreq }}
<changefreq>{{ . }}</changefreq>{{ end }}{{ if ge .Sitemap.Priority 0.0 }}
<priority>{{ .Sitemap.Priority }}</priority>{{ end }}{{ if .IsTranslated }}{{ range .Translations }}
<xhtml:link
rel="alternate"
hreflang="{{ .Lang }}"
href="{{ .Permalink }}"
/>{{ end }}
<xhtml:link
rel="alternate"
hreflang="{{ .Lang }}"
href="{{ .Permalink }}"
/>{{ end }}
</url>
{{ end }}
{{ end }}
</urlset>
I've taken the Hugo sitemap template and added logic as it iterates over the URLs to exclude URLs that begin with "tags" or "categories". Now my sitemap is aligned with the robots.txt. No more Google search console errors.
This is the new if line I added to the template to exclude the tags and categories paths. Don't forget to close the "if" with an "end".
{{ if not (or (hasPrefix .RelPermalink "/tags") (hasPrefix .RelPermalink "/categories")) }}
If you need to add a robots.txt file to your site and are looking to avoid errors in the Google Search Console on your submitted sitemap.xml, make sure that you are not submitting a sitemap that includes pages that are blocked by robots.txt.
Hugo allows you to override the default sitemap generator by providing your own template and adjusting it.
For more posts on Hugo see:
Photo by Vivienne Nieuwenhuizen on Unsplash