How do I stop a “never-ending” crawl due to dynamic URLs or querystrings?

On rare occasions Crawl will encounter a site that creates links dynamically — and seemingly ad infinitum — either due to a programming error, or sometimes simply due to a large number of dynamic parameter permutations.

In this case, you will find Crawl continuing to crawl a seemingly never-ending number of dynamically-created pages. Often these will manifest as search result queries/filters, for example the following example URLs:

http://www.diffbotfashion.com/pants?color=black&waist=32&inseam=30&type=chino&fabric=cotton
http://www.diffbotfashion.com/pants?color=black&waist=33&inseam=30&type=chino&fabric=cotton
http://www.diffbotfashion.com/pants?color=black&waist=34&inseam=30&type=chino&fabric=cotton
http://www.diffbotfashion.com/pants?color=black&waist=35&inseam=30&type=chino&fabric=cotton

The way to address this issue is twofold:

Step 1: Download the URL Report

Download your crawl’s URL Report to help determine which patterns are being needlessly repeated. It’s recommended that you download the “Last 500 URLs” rather than the full report to save download time (and your hard drive space), as these files — particularly for never-ending crawls — can be quite large.

Once you have the URL Report, you can quickly see which URL patterns are being repeated.

Step 2: Add Negative Crawling (and Processing, if Necessary) Patterns

Then, identify the patterns you wish to exclude and add them as negative crawl patterns (prepend each term/pattern with an exclamation point). Once you’ve done so, any URLs containing these terms will no longer be crawled.

If your crawl is set to process all pages (typically using the Analyze API), it’s also a good idea to duplicate these new negative patterns as processing patterns as well, to prevent errant processing of unneeded duplicates.

It may take a round or two of checking the URL Report to fully exclude all of the dynamic links. Once you have, your crawl should finish its round within a few minutes.