Restricting Crawls to Domains and Subdomains

Crawl has the following default behavior:

  • If a seed URL contains a non-www subdomain (http://blog.diffbot.com or https://docs.diffbot.com), crawling will be limited to the specified subdomain.
  • If a seed URL lacks a subdomain or uses “www” (http://www.diffbot.com), crawling will extend to the entire domain.

If you enter a seed of http://blog.diffbot.com, only URLs from http://blog.diffbot.com will be crawled. If you enter a seed of http://www.diffbot.com, URLs from http://www.diffbot.com, http://blog.diffbot.com, https://docs.diffbot.com, etc. will be crawled.

To make Diffbot visit other subdomains on that domain as well, deactivate the toggle "Restrict Subdomains".

Processing Pages From Other Domains

Crawl offers limited support for processing pages on other domains.

If you need to process pages on other domains or subdomains (e.g., a blog home page presents all its links as shortened URLs), you may do so by disabling “Restrict Domain” functionality in the Crawl Dashboard UI (or the restrictDomain parameter in the Crawl API).

Doing so will enable Crawl to spider all links regardless of domain, up to one “hop” from your seed URLs. (A “hop” is one link-depth from your seed.)

To prevent over-spidering, Crawl cannot exhaustively spider multiple domains from a limited set of seed. If you wish to include multiple domains in your crawl, please provide multiple domains in your seed URLs.