SEO
1655196001

Beginner's Guide to Sitemap.xml

XML-Sitemap is a file for search engine bots that lists all pages on the site in XML format. It is designed to allow search engines to crawl and index your site more efficiently.

You should not confuse XML-Sitemap and html-sitemap for website users.

What are the types of XML-Sitemap

There are two types of sitemaps:

  • a regular sitemap – a file that holds up to 50,000 pages and has the size no more than 50MB;
  • a sitemap index file – a file that combines several regular sitemaps. It is created for large or multilingual sites whose map would exceed the size of a regular sitemap. Such files must also be no larger than 50 MB and contain no more than 50,000 sitemap URLs.

How you can try to find XML-Sitemap

There are several ways how to view the sitemap:

  • In the Robots.txt file. Type in the address line: https://site.com/robots.txt
    In the file itself, there may be an XML-Sitemap directive in the following format:
Sitemap: https://site.com/sitemap.xml
  • If you couldn't find a link to the file in robots.txt, type the following request into the address bar: https://site.com/sitemap.xml
    While the url format for Robots.txt file is strictly mandatory – /robots.txt, the url for Sitemap.xml file is optional, url can be anything.

/sitemap.xml is just a more popular XML sitemap name, but it may be different (e.g. /sitemap-categories.xml, /sitemap-en.xml, etc.).

  • You can also make a query in a search engine using search operators. You must use two operators:
    • site: – an operator that searches for the exact address;
    • filetype: – an operator that searches  for the file type you need;

To search for an XML file type, you need to form a search query like this:

site:site.com filetype:xml</>

Result:

To search for an XML file type, you need to form a search query like this

What are the elements that make up an XML-Sitemap

Since we already know that a sitemap file can be regular or index, below we will see what elements each of these types consists of.

What elements a regular sitemap consists of.

Mandatory elements:

  • The first line specifies the XML version and the obligatory encoding for sitemap files – UTF-8:

The first line specifies the XML version and the obligatory encoding for sitemap files – UTF-8

  • <urlset> is a tag which specifies the standard of the current protocol. It is parent to the tags listed below;
  • <url> – tag for each URL entry. It is parent to the tags below and a child to <urlset>;
  • <loc> – tag indicating the exact URL of the page. It is a child tag to <url>;

Optional elements:

  • <lastmod> – tag which indicates the last date of page update. It is a child tag to <url>. Unlike the previous tags, it is optional.  Please note that Google considers the value of this tag only if it coincides with the actual time of the last page update. When writing the date in this tag should use the format W3C Datetime. This format involves specifying the full date with hours, minutes, seconds, and time zone (YYYYY-MM-DDThh:mm:ss+TZD). For example: 2022-05-16T19:20:30+03:00;
  •  <changefreq> – tag which indicates the approximate refresh rate of the page. Valid values: always, hourly, daily, weekly, monthly, yearly, never;
  • <priority> – tag which indicates the priority of a page in comparison with other pages. The value is between 0.0 and 1.0.

According to the latest data from Google Search Center, the search engine ignores the values of <changefreq> and <priority> tags.

Below is an example of a sitemap in XML format:

an example of a sitemap in XML format

What elements does the sitamap index file consist of.

Mandatory elements:

  • The first line specifies the XML version and the obligatory encoding for sitemap files – UTF-8;
  • <sitemapindex> – a mandatory parent tag in relation to all of the following. Indicates the standard of the current protocol;
  • <sitemap> – a mandatory tag that contains information about each sitemap file that is part of the index sitemap. It is a child tag to <sitemapindex>;
  • <loc> – a mandatory tag, it shows the location of the sitemap file. It is a child tag to <sitemap>;

Optional elements:

  • <lastmod> is an optional tag that indicates the last date when the sitemap file was updated (but not the individual pages listed in the file itself). It is a child tag to <sitemap>.

Below is an example of an XML-sitemap index file:

an example of an XML-sitemap index file

How to create an XML-Sitemap

There are several ways to create an XML sitemap, namely:

  • using a content management system (CMS). Systems such as WordPress or Wix can generate a sitemap, accessible for search engines. You need to find information on how the CMS you are using generates a sitemap-the process is automatic, or you need to do some operations to do it;
  • manually. If your site is small, you can build a sitemap yourself, using a text editor and following syntax standards;
  • Using sitemap generators. There are many services that are capable of generating sitemaps. Among them:

Of course, the number of such generators is very large, so you can find one that you are comfortable with.

  1. Scan the required number of URLs.
  2. Open the tool "Sitemap Generator".
  3. Configure the settings you need.
  4. Click “Generate…” and select the path for saving the Sitemap files.

select the path for saving the Sitemap files

Google general information and recommendations for XML-Sitemap

  1. Google scans the URLs that you specify. So make sure the URLs are correct and accurate.
  2. All URLs you specify in a sitemap must refer to the domain for which the sitemap was created. Do not specify any other domain/subdomain.
  3. A sitemap can be placed in any part of the site, but will only affect directories below the parent. Therefore, you should place XML-Sitemap in the root directory of the site.
  4. A link to a regular XML file or to an index file can be specified in Robots.txt as follows: Sitemap: https://site.com/sitemap.xml
  5. Sitemap files must be created in UTF-8 encoding, i.e. only ASCII characters should be used.
  6. If non-ASCII characters are included in page addresses, they need to be escaped. This usually happens automatically if you do not manually create page addresses. If URLs that contain characters other than ASCII standards are not properly coded and escaped, you may receive a Google alert when you add your sitemap that pages from your XML-Sitemap have not been detected.
  7. Google does not guarantee to crawl every URL listed in the sitemap. This file only helps the system determine which pages you think are important.
  8. Google ignores the order of URLs in the sitemap.
  9. The XML-Sitemap file size must be no larger than 50,000 pages and no larger than 50MB. If this size is exceeded, create an index sitemap file that will accommodate multiple sitemap files.
  10. You should include in a XML-Sitemap only canonical, opened for indexing and scanning pages that return 200 response code, excluding pagination pages.
  11. All URLs specified in the XML-Sitemap must be open in robots.txt for crawling and indexing, inside the X-Robots-Tag header and not contain the "noindex" meta-tag.
  12. Sitemap should be automatically updated regularly when adding/removing, hiding/opening for indexing of listed pages.

Bing general information and recommendations for XML-Sitemap files

The Bing search engine does not describe radically different requirements for XML-Sitemap, only paraphrasing some of the standards listed in the Google guidelines. Therefore, we can conclude that by following Google's standards, we create a universal XML-Sitemap for Bing as well.

How to create an XML-sitemap for multilingual sites

There are 3 main ways to tell a search engine that multilingual versions of pages are not doubles:

  • rel="alternate" hreflang="x" elements in the page code – the most common way;
  • with XML-Sitemap;
  • With http-headers.

It should be noted that in 99% of cases the way to use rel="alternate" hreflang="x" elements is enough for indicating the multi-language site 

If you make a sitemap for a large site, you can additionally specify the multi-language using XML-Sitemap.

In order to specify alternative language versions of the page in the XML-Sitemap, it is necessary to:

  • specify the namespace in the <urlset> block:xmlns:xhtml="https://www.w3.org/1999/xhtml"
  • within the <url> tag, below the <loc> tag where the page URL is specified, specify the <xhtml:link> tag for each language version of the page and within the <xhtml:link> attributes rel="alternate" hreflang="x" where the specific language version should be specified. 

For example, the page has three language versions: Russian, Ukrainian and English. URLs for language versions of this page look like this:

  • https://site.com/ru/
  • https://site.com/ua/
  • https://site.com/en/

In XML-Sitemap multilingual versions of the page will look like this:

In XML-Sitemap multilingual versions of the page will look like this

Image xml-sitemaps

In some cases, the search engine cannot find images on the site. For example, when an image is loaded using JavaScript. There are two ways to point search engine crawlers to images:

  • Provide links to them in a regular XML-Sitemap.
  • Create a separate sitemap for images.

In both cases, you must specify the XML namespace in which you specify the tags for the images:

xmlns:image="http://www.google.com/schemas/sitemap-image/1.1"

Also within the <url> tag are mandatory tags for images:

  • <image:image> – contains all information about the image. For one page you can specify up to 1000 images.
  • <image:loc> – file location. In some cases the URL of the image may be different from the main domain of the site. For correct content scanning in such cases, both domains must be confirmed in Google Search Console. 

Also in the xml sitemap for images you may find optional tags, which, according to the Google Search Center, are ignored by the search engine, namely:

  • <image:caption> – image caption;
  • <image:geo_location> – location (country, city, etc.);
  • <image:title> – title of the image;
  • <image:license> – image license URL.

In addition to the aforementioned tags, the image sitemap must meet such requirements:

  • The encoding used is UTF-8;
  • XML-Sitemap for images must contain no more than 50,000 URLs and be no larger than 50 MB. If it exceeds these parameters, you need to create a sitemap index file, which in turn will hold several sitemaps.
  • this type of sitemap should contain only canonical, open for indexing and scanning pages, returning a 200 status code;
  • each URL has no more than 1000 images;
  • XML-Sitemap for pictures should contain only full-sized images without thumbnails;
  • A link to the XML-Sitemap for images or an index file should be placed in robots.txt;
  • XML-Sitemap for images should be regularly updated automatically.

An example of XML-Sitemap for images with one page and two images:

An example of XML-Sitemap for images with one page and two images

Video XML-Sitemap

A Video XML-Sitemap is a way to let the search engine know that there are videos on the page, especially if they have recently been added or are hard to find. It's an important part of search engine optimization, especially if you want your videos to show up in search results.

Google general information and recommendations for Video XML-Sitemap files:

  1. The encoding used is UTF-8.
  2. Each sitemap file for video can contain up to 50,000 video elements, and must be no larger than 50MB. If you exceed that size, you can create an index file for the video sitemap, which will contain regular XML-Sitemaps for the video.
  3. You can create a separate XML-Sitemap for videos, or you can insert the video information into a regular sitemap.
  4. You are allowed to include multiple videos from the same page.
  5. Do not include information about the video that is not related to the main content of the page. Otherwise the video might not make it to the search engine index.
  6. Googlebot ignores the entry in the Sitemap file if no video was found at the specified URL.
  7. Creating an XML-Sitemap for videos does not guarantee indexing of the files. 
  8. Specified pages must be canonical, open for indexing and scanning, returning a 200 response code.
  9. Googlebot must have access to both the video file and the player. You should not put them on pages that require authorization, forbid them in robots.txt or block them in other ways.
  10. Place a link to the XML video sitemap or index file in robots.txt.
  11. XML-Sitemap for videos should be automatically updated on a regular basis.

So, let's see what elements make up an XML-Sitemap for video

It is necessary to specify the namespace in which the tags will be defined:

xmlns:video="http://www.google.com/schemas/sitemap-video/1.1"

Also, when you create a sitemap of this type, you must specify the following mandatory tags:

  • <urlset> is a tag which specifies the standard of the current protocol. It is parent to the tags listed below;
  • <url> – tag for each URL entry. It is parent to the tags below and a child tag to <urlset>;
  • <loc> – tag indicating the exact URL of the page. It is a child tag to <url>;
  • <video:video> – parent tag for all information about one video;
  • <video:thumbnail_loc> – link to the video icon;
  • <video:title> – title of a video;
  • <video:description> – video description – maximum 2048 symbols;
  • <video:content_loc> – the actual location of a video;
  • <video:player_loc> – link to the player for a video;

You can also specify recommended tags:

  • <video:duration> – video duration in seconds, ranging from 1 to 28800 (8 hours) inclusive;
  • <video:expiration_date> – tag which tells you the date and time when the video will become unavailable. Do not add this tag if you do not want the video to become unavailable in a Google search. The date should be specified in W3C format.

Optional XML-Sitemap tags for videos:

  • <video:rating> – rating of a video. Specified in the range from 0.0 to 5.0;
  • <video:view_count> – number of views of the video;
  • <video:publication_date> – date of video publication in W3C format;
  • <video:family_friendly> – information if the video is available in safe search. If the tag is omitted then the video will be available in safe searches. Valid values are "yes" or "no".
  • <video:restriction> is a tag that can be used to deny or allow video to be shown in certain countries. If there is no tag, video will be allowed in all countries. If you want to deny a video to a country, add "relationship" attribute to that tag and specify allow or deny value. Here is an example of code which disallows a video in Canadian search results
<video:restriction relationship="deny">CA</video:restriction>;
  • <video:requires_subscription> – shows if a subscription is required to view the video. The valid values are "yes" or "no";
  • <video:uploader> – the name of the user who uploaded the video. You may specify only one uploader name for one video. Also in the info attribute (optional) you can specify a link to the uploader information;
  • <video:live> – specifies if video is live. Valid values are "yes" or "no";
  • <video:tag> – tags for the video. You can specify up to 32 tags per video.

It is worth mentioning that Google listed the attributes it no longer considered: <video:category>, <video:gallery_loc>, the autoplay and allow_embed attributes of tag <video:player_loc> as well as tags <video:price> and <video:tvshow>.

An example of what a sitemap for a video can look like:

An example of what a sitemap for a video can look like

Google News sitemap

For news sites, you can create a separate map with dynamic generation and daily updates. These files will only work for resources included in Google News lists. If the site is not on the list, you can submit a request to add it.

The sitemap file should only contain the URLs of articles published in the last two days. Articles published more than two days ago can be removed from the file and will remain in the Google News index for 30 days.

This sitemap can contain a maximum of 1,000 URLs. This restriction is due to the fact that XML-Sitemap files for Google News are crawled more frequently than regular sitemaps and thus the search engine avoids an excessive load. If there is more content on the site in two days, you can create a sitemap index file for multiple maps.

Google recommends updating your XML-Sitemap for Google News as new articles are published. Place such a sitemap either in the root directory or in the news section of the site.

The main elements that make up a Sitemap for news:

  • You need to specify the namespace for news sitemaps:
xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"

Mandatory tags:

  • <news:news> – parent tag for all news tags;
  • <news:publication> – the publication that published the article. It contains two obligatory child elements:
    • <news:name> – publication name;
    • <news:language> – language in ISO 639-1 format;
  • <news:publication_date> – the exact date in W3C format;
  • <news:title> – the title of the article, which should be the same as on the site.

An example of a sitemap for Google News:

An example of a sitemap for Google News

How to submit sitemap

There are the following ways to submit a sitemap to Google:

  • Using a webmaster dashboard, such as Google Search Console;

How to submit sitemap

  • Perform a ping request – send a GET request to the specified address, specifying the full URL of your XML-Sitemap:

https://www.google.com/ping?sitemap=FULL_URL_OF_SITEMAPwhere:FULL_URL_OF_SITEMAP – the full XML address of the sitemap.

Example:

https://www.google.com/ping?sitemap=https://site.com/sitemap1.xml

  • Place the sitemap address in robots.txt file and it will be detected the next time the site is crawled. Example:

Sitemap: https://site.com/sitemap1.xml

XML-Sitemap is only parsed the first time it is detected, not every time the site is crawled. If you have made changes to the file, inform the search engine using a ping request. 

XML-Sitemap errors

Following the instructions above while creating the sitemap, you can avoid basic errors. If for some reason you did make a mistake when creating this file, you can see this in the Google Search Console, under "Sitemap files".

XML-Sitemap errors

You can also check for errors in the sitemap with Netpeak Spider. To do this, select Tools – XML Sitemap Validator. 

You can also check for errors in the sitemap with Netpeak Spider

Then insert a link to the appropriate sitemap and click "Start".

insert a link to the appropriate sitemap

After scanning, the validator will indicate errors present in the sitemap (1). After pressing the "Into Table" button (2) the URLs of the pages will move from the validator to the working field of the program, where you can continue to work on them:

the URLs of the pages will move from the validator to the working field of the program, where you can continue to work on them

  

You can find out more about this operation here.

Lifehack

There was the information above that the size of the sitemap should not exceed 50 MB and contain no more than 50,000 pages – this is correct in terms of sitemap creation and Google's recommendations.

Some experts argue that sitemaps of this size are not always fully scanned and internal links are not quickly indexed. 

There are some cases where setting a capacity limit for a sitemap of 10000 pages or 1000 pages gave more effective results.

We can conclude that if you have certain problems on your site with url scanning and indexing, or if you need to quickly drive new product card pages into the index, for example, you can try to break up your sitemap into smaller parts and put them into the index sitemap.

Smaller URL lists are supposed to be easier for the search engine to crawl. 

At the same time sitemaps should not be split too small, into tens of thousands of files, because Google Search Console in its reports shows information only on 1000 URLs of sitemaps, so you may not be able to receive data about XML-Sitemaps URLs from the GSC.

It is necessary to calculate the volume of each sitemap, based on the size of the site. Based on some cases we can test the splitting of sitemap files by sections, number of urls, and newness of content.

Conclusions

  • XML sitemap is necessary for search engine crawlers to discover and index relevant pages on a site. It contains the URLs of site pages, as well as related additional data, such as when they were last updated. It is very important to comply with the requirements for files of this type to ensure that the search engine will scan and index the necessary pages of the site in time. 
  • Separate maps can be created for images and video. XML sitemap can be created for Google News.
  • Create a map by hand only if your site is small, otherwise it can be very time consuming. 
  • Use CMS tools, sitemap generators and other software to create your sitemap, and also periodically check your XML for correctness.
  • Sitemaps should be regularly updated automatically, so that the search engine bot as soon as possible after the update will index the current versions of pages or not crawl those pages whose crawl instructions and accessibility have been changed. 
Topics:
19
4
Found a mistake? Select it and press Ctrl + Enter