Can Diffbot access content within an intranet or requiring a login?

(Quick answer: Yes!)

Diffbot APIs are commonly used to extract content from public web pages. There are a few ways to handle pages not publicly accessible or that provide additional information to logged-in users:

Authenticate Using Custom Headers (Cookies)

The most consistently applicable way to “log-in” to a site is to provide a cookie value corresponding to a logged-in user. (You can set custom headers in individual API requests, while crawling or while performing Bulk Processing jobs. See how to set custom headers.)

To capture a cookie value in the first place:

  1. First, log-in to the site using your regular browser.
  2. Open your browser developer tools’ “Network” panel. If it’s empty, you may need to refresh your page to see the network requests made.
  3. Select the primary page request and then find the “cookie” entry in your request headers. This will likely be a long string. Select and copy the entire string.
  4. Optional: much of the data in a cookie string will not be necessary, so you can reduce some of the content here if you can easily determine which subset of the data is required for logging in.

You can now use this cookie value in your individual requests or within your Crawlbot crawls or Bulk Processing jobs.

POST the Content Directly

If you have access to your target content (e.g., you are processing pages within a corporate intranet, or you have an offline archive of markup or text), you can POST HTML directly to any of our API endpoints, automatic or custom. Diffbot will process your content as it would a directly-accessible web page.

See specific API documentation for more details on how to craft your POST:

Basic Authentication

To access pages that require basic access authentication, include the username and password in your API request’s url parameter, e.g.

A full request example: