Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

(Optional, Boolean) Sets whether to ignore sitemap detection and resolving for URLs processed. Default is false.

skipParsing

(Optional, Boolean) Sets whether to upload crawled documents to the binary data service, or parse them and add their text info to the records instead. For a more detailed explanation, see the Parsing of crawled documents section. Defaults to false.

maxCrawlers (Optional, Integer) Number of crawlers to handle the URLs. Default is 1.

...

The minimum schedule window for the Website connector is 5 seconds. This means, that at best, it can run every 5 seconds. Otherwise, it will throw an error because there is a execution already running.

Parsing of crawled documents

By default, the library used for crawling, Norconex, attempts to parse all downloaded files and to extract text information from them, as well as metadata. For the vast majority of file types, Norconex will use an Apache Tika parser and therefore the records put in the pipeline will contain Tika's output. More info about how different file types are handled can be found here.

This default parsing behaviour can be disabled with the optional skipParsing configuration value. If said value is set to true, then neither will the connector parse crawled files nor place its contents in the records added to the pipeline. The files will instead be saved in the binary data server as they were downloaded. These uploaded files can be accessed in subsequent pipeline stages through the binaryContent field added to each record, which contains the key to retrieve the original file from the BDS.