Can I limit extraction to articles written before, after or between certain dates?

It is the very act of processing by which Diffbot is able to determine an article’s date (returned and normalized in the Article or Discussion API date fields). Until a page is processed, its date is unknown.

This is true for other extracted fields, as well. For example, Crawl cannot limit processing to products whose prices are greater than $100, or whose discussion threads contain more than fifteen posts. These facts are not known until Diffbot has processed the page.

An alternative is to query the Diffbot Knowledge Graph instead of extracting your own articles. A simple DQL query to download all the techcrunch.com articles published in January 2022 (View in Dashboard) looks like this:

type:Article site:"techcrunch.com" date>"2022-01-01" date<"2022-01-31" sortBy:date