Custom API Rulesets

A set of rules and parameters defining what a Custom API actually extracts.

Every instance of a Custom API is defined by a JSON ruleset object, which will include a rules objects array, the name of the custom api, and a urlPattern matching URLs to be extracted with this API.

A simple ruleset object looks like this.

{
  "rules": [
    {
      "name": "Description",
      "selector": ".entry-content p"
    }
  ],
  "api": "/api/list",
  "urlPattern": "(http(s)?://)?(.*\\.)?blog.diffbot.com.*",
  "testUrl": "https://blog.diffbot.com/knowledge-graph-glossary/"
}

In this ruleset, the List API is extended to also extract a Description field for URLs matching the urlPattern.

A complete Custom API ruleset contains (at minimum) all of the following fields.

FieldDescription
urlPatternRegular expression used to match URLs to the appropriate rule.
apiDiffbot API against which the ruleset should be applied. The api value should include the /api/ string, e.g. /api/article.
rulesAn array of rules applying to individual fields of the Diffbot API. The rules array can be empty (rules=[]). More on rules.
nameField to correct (e.g., title) or add (e.g., customField).
selectorCSS selector to find the appropriate content on the page.
valueOptional: a specific value to hard-code, in lieu of a selector.
filtersOptional: additional options to replace content, ignore selectors, or extract HTML attribute values. See below.

In addition, Custom API rulesets may also include these optional parameters.

FieldDescription
testUrlOptional: A sample URL used to preview your rule within the Custom API Toolkit in the Dashboard.
prefiltersOptional: An array of selectors that should be completely dropped from the DOM. These selectors will be fully ignored by all Diffbot processing.
renderOptionsOptional: Querystring arguments to be passed to the Diffbot rendering engine, e.g. wait=5000. More on renderOptions.
xForwardHeadersOptional: An object containing any custom headers to be passed along in all requests to URLs matching the urlPattern. Header values can either be a single string, or an array of strings (from which one will be selected at request-time). Custom headers can include:
User-AgentOptional: User agent to use in place of Diffbot default.
ReferrerOptional: Custom referrer to use in place of Diffbot default.
CookieOptional: Custom cookie content to be sent with all requests.
Accept-LanguageOptional: Custom accept-language header to be sent.
X-EvaluateOptional: Custom Javascript to be executed at render-time.

Defining a Rule

To recap — a single Custom API instance is defined by a JSON ruleset object. This ruleset object contains an array of rule objects as well as the parameters listed above.

In this section, we look at what defines a single rules object that lives within the rules field of a complete Custom API ruleset.

Here's an example of a simple rule object

{
	"selector": ".entry-content p",
	"name": "text"
}

A Custom API with this rule will

  1. Look for a DOM element corresponding to the CSS selector .entry-content p
  2. Extract the text content of that element
  3. Return it in the response of the Custom API under the field named text

📘

Custom API rules can be used to "correct" individual fields of an Extract API

To correct a field that isn't extracting automatically, define a custom rule using the same name as the incorrectly extracted field.

Experience with CSS selectors will be very helpful in defining Custom API rules. A reference of all supported selectors and operators are available here.

Should multiple elements match a selector, the text contents of all the elements will be returned string concatenated in the output value.

A rule may also extract the value of an attribute on the selected element. To do this, we can use a rule filter.

Using Rule Filters

filters may be used in a Custom API rule to get an attribute value of an element, replace content extracted, or exclude certain sections of content.

Here's an example of a rule filter that extracts the src value of all img elements.

{
  "selector": "img",
  "name": "url",
  "filters": [
    {
      "args": [
        "src"
      ],
      "type": "attribute"
    }
  ]
}

A filter object is constructed with an args and a type field.

  • type specifies a filter type to be used (attribute, exclude, or replace)
  • args is an array of arguments to be provided to the filter

A rule may contain multiple filters, hence its representation in a rule as a JSON array.

More details on the use of each available filter is shared below.

Filter Type: attribute

Retrieves the attribute value of an element specified in args.

For example, to extract the link http://blog.diffbot.com from the anchor tag <a href="http://www.blog.diffbot.com" class="outbound">, we may use the following rule:

{
  "selector": "a.outbound",
  "name": "link",
  "filters": [
    {
      "args": [
        "href"
      ],
      "type": "attribute"
    }
  ]
}

Filter Type: exclude

Ignores selectors (and all descendants) supplied in args if they are found within the CSS selector of the parent rule.

Filter Type: replace

Use regular expression syntax to extract only specific sections of text from the original extraction output. Supply your regular expression in the 1st index of your array and the regex group to extract in the 2nd.

For example, this is how you would extract just the numerical price (12.99) off a pricing element (.offerPrice) that extracts as "$12.99" by default.

{
  "selector": ".offerPrice",
  "name": "price",
  "filters": [
    {
      "args": [
        "^\$(.*)$",
        "$1"
      ],
      "type": "replace"
    }
  ]
}

Back references are also supported. For example, you can prepend text with the replace selector (^.*$) and replacement prefix: $1

Diffbot uses a Java implementation for its regular expression parsing. Regular-Expressions.info offers an excellent overview of language-specific distinctions.

Extracting Multiple Elements into a List

If a CSS selector matches multiple elements on a page, the text values of all the matched elements will be concatenated into a single output value for the field.

To structure the output into an array instead, we can nest rules within rules, we call this a collection.

This is an example of a collection and the HTML structure it will extract.

<div class="img-thumbnail">
  <img src="img-1.png" />
  <span class="img-caption">Image #1's caption.</span>
</div>
<div class="img-thumbnail">
  <img src="img-2.png" />
  <span class="img-caption">Image #1's caption.</span>
</div>
{
  "selector": "img-thumbnail",
  "name": "images",
  "rules": [
    {
      "selector": "img",
      "name": "url",
      "filters": [
        {
          "args": [
            "src"
          ],
          "type": "attribute"
        }
      ]
    }
  ]
}

We start by defining the largest parent element enclosing the repeating elements (.img-thumbnail). We then define a nested rules object that extracts the src attribute of every img element inside the repeating parent element.

Notice that each img-thumbnail element also encloses a caption. We can extract that caption alongside the src of each image by adding an additional rule in the same nested level as the src extraction rule.

{
  "selector": "img-thumbnail",
  "name": "images",
  "rules": [
    {
      "selector": "img",
      "name": "url",
      "filters": [
        {
          "args": [
            "src"
          ],
          "type": "attribute"
        }
      ]
    },
    {
      "selector": "span.img-caption",
      "name": "caption"
    }
  ]
}

Deleting fields

If you do not want a particular field to appear in the output JSON, you can accomplish this via a rule. The rule below will ensure images will not appear in the output. Notice that the field delete is set to true i.e. without quotes.

{
  "rules": [
    {
      "name": "images",
      "delete": true
    }
  ]
}

Forcing extraction from a particular section of the page for ListAPI

You can force list extraction from specific node(s). More precisely, you can specify multiple containers from which to force List extraction by separating the XPaths with a pipe |. ListAPI will treat each container specified for extraction separately. In order to be able to distinguish which list item corresponds to which user defined container, the resulting listings will contain an extra key containerXpath. To do this, simply specify the XPaths of the container nodes in your rules for the field items like this:

{
    "rules": [
        {
            "name": "items",
            "selector": "/html/body/div[1]/div/div[1]/div/div/section[4]/div/div | /html/body/div[1]/div/div[1]/div/div/section[5]/div/div/div/div/div/div/div/div/article/div/div[2]/div[2]/table"
        }
    ],
   "api": "/api/list"
}

The XPaths of the two container nodes would be:

  1. /html/body/div[1]/div/div[1]/div/div/section[4]/div/div
  2. /html/body/div[1]/div/div[1]/div/div/section[5]/div/div/div/div/div/div/div/div/article/div/div[2]/div[2]/table