Crawlbot will often encounter duplicate pages (with different URLs) while canvassing a site. There are two ways Diffbot helps you handle these duplicates:
Pages with duplicate HTML sources will be ignored while crawling
While crawling (spidering for links), and before sending a URL to be processed, Crawlbot examines the raw HTML source of each page and compares it to the source HTML of all previously-spidered pages. Any exact matches to previously-seen pages will be flagged as duplicates and ignored.
The Crawlbot URL Report — available from each crawl’s status page, or via the Crawlbot API — will note each duplicate URL, and the document ID (docId) of the page it duplicates.
Duplicated extractions will have the same diffbotUri
Each Diffbot JSON object contains the
diffbotUri field. The value is uniquely calculated from a subset of extracted fields and can be used to uniquely identify the extracted content. The
diffbotUri will be the same across duplicate extractions.
For example, the
diffbotUri value for this page is
For URLs that are not exact-source duplicates (and are thus not ignored while crawling), but that result in the same extracted output, the
diffbotUri values will be the same. When you process your crawl data, filtering and removing objects with the same
diffbotUri will allow you to retain only one example of each entity.