Which regular expression standard / syntax does Crawlbot use?

If you wish to explicitly control the pages Crawlbot crawls and/or processes, you can optionally use a regular expression. Each page URL will be evaluated against these regexes and matches will be spidered or processed.

There are a number of different regular expression implementations, all of which differ slightly in their syntax. Crawlbot does not use a specific implementation, but rather a custom regular expression engine to ensure the best possible performance while evaluating pages.

In terms of character class syntax — the most common regex concept/sequence used in Crawlbot parsing — Crawlbot supports all ASCII processing characters in the following table, and most Perl/Tcl shortcuts:

Perl/Tcl ASCII Description
[A-Za-z0-9] Alphanumeric characters
\w [A-Za-z0-9_] Alphanumeric characters plus “_”
\W [^A-Za-z0-9_] Non-word characters
[A-Za-z] Alphabetic characters
[\t] Space and tab
\b (?<=\W)(?=\w)|(?<=\w)(?=\W) Word boundaries
[\x00-\x1F\x7F] Control characters
\d [0-9] Digits
\D [^0-9] Non-digits
[\x21-\x7E] Visible characters
[a-z] Lowercase letters
[\x20-\x7E] Visible characters and the space character
[][!"#$%&'()*+,./:;<=>?@\^_`{|}~-] Punctuation characters
\s [\t\r\n\v\f] Whitespace characters
\S [^ \t\r\n\v\f] Non-whitespace characters
[A-Z] Uppercase letters
[A-Fa-f0-9] Hexadecimal digits