Robots.txt

Main Hero

Definition

A robots.txt file is a text file placed in the root directory of a website that gives instructions to search engine crawlers about which pages or sections should be crawled or excluded. It is part of the Robots Exclusion Protocol and is one of the first files crawlers check before scanning a site.

This file helps website owners manage how search engines interact with their content. For example, a company may use robots.txt to block crawlers from indexing admin pages, staging environments, or duplicate content, while still allowing public-facing content to be discovered and ranked.

Advanced

Robots.txt works by defining rules for user-agents (search engine bots). Commands such as Disallow, Allow, Crawl-delay, and Sitemap provide instructions on what crawlers can or cannot access. The file is case-sensitive and must be placed at rubixstudios.com.au/robots.txt to be recognized.

Advanced usage involves creating separate rules for different bots, such as Googlebot, Bingbot, or ad crawlers. While robots.txt can restrict crawling, it cannot guarantee content won’t appear in search results, blocked URLs may still show if linked externally. For complete control, it is often combined with meta robots tags or noindex directives. Monitoring crawl logs and Search Console reports ensures rules are functioning as intended.

Why it matters

  • Controls how search engines interact with site content.
  • Conserves crawl budget by preventing bots from wasting resources on low-value pages.
  • Protects sensitive or irrelevant areas of a site from being indexed.
  • Provides signals to different search engine crawlers for optimization.

Use cases

  • Blocking duplicate pages such as print-friendly versions.
  • Restricting access to admin or login directories.
  • Managing crawl behavior for large e-commerce sites with thousands of URLs.
  • Including sitemap location to help crawlers discover content.

Metrics

  • Crawl stats in Google Search Console.
  • Number of blocked vs. allowed requests in server logs.
  • Changes in crawl budget efficiency.
  • Index coverage reports to confirm only intended pages appear in SERPs.

Issues

  • Misconfigured robots.txt blocking important content from indexing.
  • Assuming robots.txt hides sensitive data (it does not guarantee privacy).
  • Differences in crawler behavior, some bots ignore rules.
  • Overly broad disallow rules harming organic visibility.

Example

An online store mistakenly blocks /products/ in its robots.txt file, preventing all product pages from being crawled and indexed. After correcting the directive to allow Googlebot access, the pages begin appearing in search results again, restoring lost traffic.