Will Crawlbot spider across domains or subdomains?

Crawlbot has the following default behavior:

  • If a seed URL contains a non-www subdomain (http://blog.diffbot.com or http://support.diffbot.com), crawling will be limited to the specified subdomain.
  • If a seed URL lacks a subdomain or uses “www” (http://www.diffbot.com), crawling will extend to the entire domain.

If you enter a seed of http://blog.diffbot.com, only URLs from http://blog.diffbot.com will be crawled. If you enter a seed of http://www.diffbot.com, URLs from http://www.diffbot.com, http://blog.diffbot.com, http://support.diffbot.com, etc. will be crawled.

Processing Pages From Other Domains

Crawlbot offers limited support for processing pages on other domains.

If you need to process pages on other domains or subdomains (e.g., a blog home page presents all its links as shortened URLs), you may do so by disabling “Restrict Domain” functionality in the Crawlbot UI (or the restrictDomain parameter in the Crawlbot API). Doing so will enable Crawlbot to spider all links regardless of domain, up to one “hop” from your seed URLs. (A “hop” is one link-depth from your seed. Read more on hops.)

To prevent over-spidering, Crawlbot cannot exhaustively spider multiple domains from a limited set of seed. If you wish to include multiple domains in your crawl, please provide multiple domains in your seed URLs.