Getting Started with Crawl

Spider a site for links and process them with Extract API

Crawl works hand-in-hand with Extract API (either automatic or custom). It quickly spiders a site for appropriate links and hands these links to an Extract API for processing. All structured page results are then compiled into a single "collection," which can be downloaded in full or searched using the Search API.

Note: If you have a complete list of all the URLs you wish to extract, you might be looking for Bulk Extract instead.

For documentation on how to use Crawl via API, check out Introduction to Crawl API.

🚧

Access to Crawl API is Limited to Plus Plans and Up

Upgrade to a Plus plan anytime at diffbot.com/pricing, or contact [email protected] for more information.

How to use Crawl in the Dashboard

A Crawl job requires just 2 inputs (apart from auth and a name) to work:

  1. A seed URL
  2. A choice of Extract API to process URLs

A crawl job given a seed URL of https://www.diffbot.com and Analyze API will spider for every URL under the www.diffbot.com domain and process all of them with Analyze API.

The result is effectively a list of every page on diffbot.com, the page type classification of each, and the extracted data in the schema of its classified page type.

Crawl jobs may also include additional filtering logic at each step of this process to optimize for speed and reduce noise from the output data.

For example, a crawl job can be setup to extract only the products in a single product category from an e-commerce website.

With a Custom API, crawl jobs can also be built to perform technically advanced tasks such as checking for a Privacy Policy, or the existence of certain technology scripts.

Crawl Limits

  • Plus plans may have up to 25 active crawls at a time.
  • Enterprise plans may have over 100+ active crawls simultaneously.
  • All plans have a limit of up to 1000 crawls in a single token.