How does Diffbot handle duplicate pages/content while crawling?

Crawl will often encounter duplicate pages (with different URLs) while canvassing a site. There are a handful of ways Diffbot helps you handle these duplicates:

Pages with duplicate HTML sources will be ignored while crawling

While crawling (spidering for links), and before sending a URL to be processed, Crawl examines the raw HTML source of each page and compares it to the source HTML of all previously-spidered pages. Any exact matches to previously-seen pages will be flagged as duplicates and ignored.

The duplicate comparison is made on the raw HTML source only. Only when processing a page will Javascript be executed.

Duplicate URLs are noted in the URL Report

The URL Report — available from each crawl’s status page, or via the Crawl API — will note each duplicate URL, and the document ID (docId) of the page it duplicates.

Note: If your crawl takes advantage of Analyze API’s ability to execute Javascript to find Ajax-delivered links, Crawl's duplication detection will be disabled. This is because Ajax-powered sites can have identical HTML source code for multiple pages, even though the actual on-page content (when Javascript is fully executed) is quite different.

Pages with a different canonical link definition will be ignored

Two things will happen when a page contains a canonical link element different from its own URL:

  1. The current page will be skipped/ignored as a duplicate.
  2. The canonical URL will be automatically added to the Crawl queue (if not already in the queue)