What Is a robots.txt File?
A robots.txt file, also called a robots txt file, is a simple text file placed on a website's server. It tells web robots (spiders, crawlers, or bots) which pages or files they can and cannot request from a website. This file controls the behavior of search engine spiders and other robots that crawl websites.
Web admins use robots.txt files to prevent search engines from indexing certain pages or sections of their websites. They can also use this file to prevent robots from accessing sensitive or confidential information, such as login pages or admin panels. Using robots.txt, website owners can prevent robots from accessing certain web pages that they don't want to be indexed by search engines.
- What Is the Purpose of the Robots File?
- Common Uses of robots.txt
- How to Use robots.txt?
- Must-Knows About robots.txt
- How to Check if You Have a robots.txt File
- Google and the Bing Network
- Summary
What Is the Purpose of the Robots File?
When a search engine crawls (visits) your website, the first thing it looks for is your robots.txt file. This file tells search engines what they should and should not index (save and make available as search results to the public). It also may indicate the location of your XML sitemap. The search engine then sends its "bot" or "robot" or "spider" to crawl your site as directed in the robots txt file (or not send it if you said they could not).
Google's bot is called Googlebot, and Microsoft Bing's bot is called Bingbot. Like Excite, Lycos, Alexa, and Ask Jeeves, many other search engines also have their bots. Most bots are from search engines, although sometimes other sites send out bots for various reasons. For example, some sites may ask you to put code on your website to verify you own that website, and then they send a bot to see if you put the code on your site.
Keep in mind that robots.txt works like a "No Trespassing" sign. It tells robots whether you want them to crawl your site or not. It doesn't block access. Honorable and legitimate bots will honor your directive on whether they can visit. Rogue bots may ignore robots txt file.
Please visit Google's official stance on the robots.txt file for more information.
Common Uses of Robots.txt
The robots.txt file is a fundamental part of a website's root directory that's used primarily to communicate with web crawlers and other web robots. It provides instructions about which areas of a website should not be processed or scanned by these bots. Take a look at the common uses of the robots.txt file:
- Controlling crawler access. It tells search engine bots which pages or sections of the site shouldn't be crawled. This can prevent search engines from indexing certain pages, such as admin pages, private sections, or duplicate content, that you don't want to appear in search engine results.
- Preventing resource overload. By limiting crawler access to heavy resource pages, robots.txt can help prevent web server overload. This is useful for sites with limited server resources that can't handle heavy bot traffic along with regular user traffic.
- Securing sensitive information. Although not a foolproof security measure, robots.txt can request bots to avoid indexing sensitive directories or files. It's important to note that this shouldn't be the sole method of protecting sensitive information since not all bots follow the instructions in robots.txt.
- Managing crawl budget. For large websites, robots.txt can help manage the crawl budget by directing search engine bots away from low-priority pages. This ensures that important pages are crawled and indexed more efficiently, which improves your site's visibility in search engine results.
- Specifying sitemap locations. You can use robots.txt to specify the location of your XML sitemap(s). This makes it easier for search engines to discover and index your site's pages, which can enhance your SEO efforts.
- Blocking unwanted bots. While reputable search engine bots follow the directives in robots.txt, the file can also be used to block known unwanted bots, such as scrapers or malicious bots, from accessing the site. However, since compliance is voluntary, not all bots will honor these requests.
- Experimentation and testing. Developers and SEO professionals might use robots.txt to temporarily block search engines from indexing under-construction areas of a site or new features until they're ready for public viewing and indexing.
How to Use Robots.txt?
To use robots.txt file, you need to follow these basic steps:
- Create a plain text file named robots.txt with a text editor or Notepad.
- Enter the instructions for the web robots in the file.
- Save the file as robots.txt.
- Upload the file to the root directory of your website using an FTP client or cPanel file manager.
- Test the file using the robots.txt Tester tool in Google Search Console to ensure it works properly.
There are several instructions or directives that you can include in the robots.txt file, such as User-agent, Disallow, Allow, Crawl-delay, and Sitemap.
- User-agent Directive: This directive specifies the robot to which the instruction applies.
- Disallow Directive: This directive is used to exclude certain pages or directories from indexing by the robot.
- Allow Directive: This directive informs the robot about the pages or directories that are allowed to be indexed.
- Crawl-delay: Indicates how many seconds a crawler should wait before loading and crawling page content. Note that the Googlebot doesn't acknowledge this command, but the crawl rate can be set in Google Search Console.
- Sitemap: Used to specify the location of any XML sitemap(s) associated with this URL.
It's important to note that while the robots.txt file can help control which pages are indexed by web robots, it's not guaranteed that those pages won't appear in search results. Search engines may ignore the robots txt file and index pages anyway.
Where Does robots.txt Go?
The robots.txt file belongs in your document root folder. You can create a blank file and name it robots.txt. This will reduce site errors and allow all search engines to rank anything they want.
Allow All Web Crawlers Access to All Content
To allow all web crawlers full access to your site, you can use:
User-agent: *
Disallow:
This configuration specifies that all bots are allowed to crawl the entire website because the Disallow directive is empty.
Blocking Robots and Search Engines from Crawling
If you want to stop bots from visiting your site and stop search engines from ranking you, use this code:
#Code to not allow any search engines!
User-agent: *
Disallow: /
You can also prevent robots from crawling parts of your site while allowing them to crawl other sections. The following example would request search engines and robots not to crawl the cgi-bin folder, the tmp folder, the junk folder, and everything in those folders on your website.
# Blocks robots from specific folders / directories
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /junk/
In the above example, http://www.yoursitesdomain.com/junk/index.html would be one of the URLs blocked, http://www.yoursitesdomain.com/index.html, and http://www.yoursitesdomain.com/someotherfolder/ would be crawlable.
Block a Specific Web Crawler from a Specific Web Page
This configuration tells only Google’s crawler (Googlebot) not to crawl any pages that contain the URL string www.example.com/example-subfolder/.
User-agent: User-agent: Googlebot Disallow: /example-subfolder/
Note: That robot.txt works like a No Trespassing sign. It tells robots whether you want them to crawl your site or not. It doesn't block access. Honorable and legitimate bots will honor your directive on whether they can visit. Rogue bots may ignore robots.txt. As explained below, you MUST utilize the webmaster tools for Bingbot and Googlebot since they do not respect the robots txt file.
Must-Knows About robots.txt
When dealing with robots.txt files, There are several crucial points you need to understand to use robots.txt effectively and avoid common pitfalls. Here's what you should know about robots.txt:
- Proper placement. The robots.txt file must reside in the root directory (e.g., https://www.example.com/robots.txt) for crawlers to find and obey its directives.
- Voluntary adherence. Not all bots respect robots.txt directives, especially malicious ones. It's a protocol based on cooperation, not enforcement.
- Not for security. robots.txt is publicly visible and should not be used to protect sensitive data. Use authentication methods to secure private information on your site.
- Syntax precision. Errors in syntax can lead to unintended crawling behavior, which makes it critical to follow the correct format and understand that directives are case-sensitive.
- Selective access. You can specify which bots are allowed or disallowed from accessing parts of your site. This gives you detailed control over bot traffic.
- Noindex for removal. To remove already indexed content, robots.txt isn't effective. Use meta tags with noindex or specific tools provided by search engines for content removal.
- Regular reviews. Your robots.txt file should be checked regularly to ensure it aligns with your site's evolving structure and content strategy. This way, you can ensure that it reflects your current preferences accurately for search engine crawling.
How to Check if You Have a robots.txt File
Simply add /robots.txt at the end of your website's root domain. For example, if your website's URL is https://www.example.com, the URL to check would be https://www.example.com/robots.txt.
If no .txt appears, then you don't currently have a live robots.txt page.
Google and the Bing Network
You can create Google and Bing Network Webmaster accounts and configure your domains for a lower crawl delay. Read Google's official stance on the robots.txt file. You MUST utilize Google's Webmaster tools to set most of the parameters for GoogleBot.
Important note: GoogleBot and the Bingbot Network do NOT honor standard robots txt file, and limiting the crawl rates of these bots will have to be done directly with Google/Bing.
We do still recommend configuring a robots.txt file. This will reduce the rate at which crawlers initiate requests with your site and reduce the resources required from the system, allowing for more legitimate traffic to be served.
If you would like to reduce traffic from crawlers such as Yandex or Baidu, these typically need to be done utilizing a .htaccess block.
For more details regarding these topics, please reference the links listed below:
Ask Google to Recrawl Your URLs
If you've made changes or updates to a page on your website, you can request that Google re-index the page to ensure that the latest version is reflected in their search results. However, you can only request indexing for your own or manage URLs. This means that if the URL belongs to another website or if it's a page on your site that you don't have access to, you won't be able to request re-indexing for that URL.
Crawling can take anywhere from a few days to a few weeks. Be patient and monitor progress using the Index Status report or the URL Inspection tool.
Note: Requesting a crawl doesn't guarantee that inclusion in search results will happen instantly. Our systems prioritize the fast inclusion of high-quality, useful content.
Important note: GoogleBot and the Bingbot Network do NOT honor standard robots.txt files, and limiting the crawl rates of these bots will have to be done directly with Google/Bing.
Use the URL Inspection Tool (Just a Few URLs)
To use the URL Inspection tool, you'll need to be an owner or full user of the Search Console property for the website. However, it's important to note that there is a quota for submitting URLs, so you can only request indexing for some of your pages simultaneously. Additionally, requesting multiple recrawls for the same URL won't necessarily make the crawl happen any faster.
Submit a Sitemap (Many URLs at Once)
If you have many URLs on your website, it's a good idea to submit a sitemap to Google. A sitemap is a file that lists all the URLs on your site that you want Google to index. It can be particularly helpful if you've just launched a new site, recently performed a site move, or have pages difficult for Google to discover through regular crawling. A sitemap can also provide additional metadata about your site, such as information about alternate language versions, videos, images, or news-specific pages. This can help Google better understand the structure and content of your site, which can lead to more accurate and relevant search results for users. Learn how to create and submit a sitemap.
Summary
Web administrators use robots.txt files to prevent search engines from indexing specific pages or sections of their websites. In addition, this file is also used to restrict robots from accessing sensitive or confidential information, such as login pages or admin panels. Using robots txt file lets website owners control which pages are indexed by search engines and which are not.
However, it's important to note that robots.txt isn't a secure method of preventing access to sensitive information. Because robots.txt is a publicly accessible file, malicious actors can easily find and read it. As a result, it shouldn't be used to protect confidential information or prevent unauthorized access to restricted website areas.
Overall, learning how to use robots txt file is an important tool for website owners to manage how search engines and other robots interact with their websites. Using this file, they can ensure that their website is crawled and indexed the way they want and that sensitive information is kept secure.
If you need further assistance, feel free to contact us via Chat or Phone:
- Chat Support - While on our website, you should see a CHAT bubble in the bottom right-hand corner of the page. Click anywhere on the bubble to begin a chat session.
- Phone Support -
- US: 888-401-4678
- International: +1 801-765-9400
You may also refer to our Knowledge Base articles to help answer common questions and guide you through various setup, configuration, and troubleshooting steps.