Does Diffbot extract non-English pages?

Yes. Sí. Oui. Ja. نعم. はい. HIja’.

Because Diffbot Extract APIs rely on computer vision, they tend to do very well identifying similar elements from pages in most languages. We also supply our algorithms with training data in multiple languages, so as to better identify linguistic and other differences.

The returned data will be in the original language. Which language it is will be specified in the humanLanguage field as a two-letter ISO-639 code, with the exception of Simplified Chinese (zh-cn) and Taiwanese Mandarin (zh-tw).

The currently supported and returned ISO codes are as follows:

  • ar
  • az
  • bg
  • bn
  • ca
  • cs
  • da
  • de
  • el
  • en
  • es
  • et
  • fa
  • fi
  • fr
  • gu
  • he
  • hi
  • hr
  • hu
  • id
  • it
  • ja
  • ko
  • lt
  • lv
  • mk
  • ml
  • nl
  • no
  • pa
  • pl
  • pt
  • ro
  • ru
  • si
  • sq
  • sv
  • ta
  • te
  • th
  • tl
  • tr
  • uk
  • ur
  • vi
  • zh-cn
  • zh-tw