How Diffbot handles multi-page articles and discussions

Diffbot’s Article and Discussion APIs allow for automatic page concatenation: the ability to string-together multiple pages into a single response.

The Article API by default will automatically concatenate multiple page articles — up to twenty pages total — into single ‘text’ and ‘html’ responses, and media items from multiple pages into the ‘images’ and ‘videos’ arrays.

To disable this functionality, pass paging=false in your Article API request.

The Discussion API will not concatenate by default. If you wish to enable concatenation, use the maxPages argument to define the maximum number of pages you wish to be returned in a response. Use maxPages=all to return all pages regardless of length.

When an article or discussion thread had multiple pages concatenated, you will see two additional fields in your default response:

  • numPages: number of pages in total concatenated to form the full output
  • nextPages: a list of additional URLs that were extracted

Pagination not working as expected?

On occasion a site’s unique pagination design or terminology will confuse our concatenator. In this case you can add the concatenation functionality for a particular article or discussion page using our Custom API. This is how you set one up.

  1. Create a new Custom API for your page
  2. Create a new custom field named nextPage.
  3. Select the element that contains the link to the next page.
  4. Add an “attribute” filter using the Filters drop-down, and in this field enter href to make sure the URL value is returned.

A few notes:

  • In some Javascript-based pagination, this URL value may not be available or may be available in a different attribute. If it is not available, you will not be able to create an override.
  • This method only works for article and discussion APIs.

Sometimes sites don’t identify the next page link using unique CSS selectors (particularly on sites that have links to individually-numbered pages).

For instance, an older layout of Slate.com used the same class — .sl-art-pag-link — for all links to individual pages, even pages prior to the current page. Using this class alone could result in multiple nextPage values and an infinite processing loop.

Our concatenation algorithm will generally prevent infinite loops and repeated content, but writing better CSS selectors will ensure the best performance. In this case, using the following selector will ensure that only the correct next page is identified:

.sl-art-curpage + .sl-art-pag-link

This uses the plus-sign combinator to identify only the page link that is immediately preceded by the current page (.sl-art-curpage). This ensures that only the next page — if it exists — is identified.