Can I limit processing to articles written before, after or between certain dates?

Generally: no.

This falls under the “no free lunch” reality of Diffbot services. It is the very act of processing by which Diffbot is able to determine an article’s date (returned and normalized in the Article or Discussion API date fields). Until a page is processed, its date is unknown.

(This is true for other extracted fields, as well. For example, Crawlbot cannot limit processing to products whose prices are greater than $100, or whose discussion threads contain more than fifteen posts. These facts are not known until Diffbot has processed the page.)

The Exception: Using HTML Processing Patterns

The rare exception to this will be when crawling sites whose HTML provides consistent markup that allows for HTML Processing Patterns to be implemented. HTML Processing Patterns restrict processing of pages to those URLs whose raw HTML source contains the exact string(s) provided.

For instance, if a web site consistently presents its article dates as follows…

<h6 class="dateline">Wednesday, 19 July 2017</h6>

…it would be possible to limit article processing by entering a number of specific HTML Processing Patterns. To restrict processing only to articles written between April and August, 2017, you could add the following:

April 2017</h6>
May 2017</h6>
June 2017</h6>
July 2017</h6>
August 2017</h6>

To restrict processing to articles written before 2017, you could add the following:


…or, using the negative convention (leading exclamation point):


It’s important to ensure that your HTML Processing Patterns are exact matches to the source HTML, and that your chosen strings are unique to the pages you wish to process.