Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

Based on Norconex HTTP collector, allows crawling of websites to fetch content and discover links and page structure within a starting URL.

Configuration

Example configuration in a seed:

{
  "seed": {
    "crawlUrls": [
      "https://www.website.com"
    ],
    "userAgent": "pureinsights-website-connector"
  },
  "name": "Website Crawler",
  "id": "5f40c3b9-901c-434d-a816-046464943016",
  "type": "website-connector",
  "pipelineId": "9901132e-5b03-415e-9131-5443282dd0dd",
  "timestamp": "2021-09-24T13:33:05.941392800Z"
}

Configuration parameters:

crawlUrls

(Required, Array of Strings) A list of start URLs that need to be crawled. The URLs can be from different domains. Robots.txt and sitemaps will be used and respected if present.

userAgent

(Optional, String) Used to identify requests against the target websites. Is recommended to set it to something identifiable. This will allow website owners to identify traffic from the connector and possibly avoid it being blocked or throttled. Defaults to pureinsights-website-connector.

maxDepth

(Optional, Int) if set, will limit the number of references to follow from a page start to finish. Default is unlimited.

ignoreSiteMap

(Optional, Boolean) Sets whether to ignore sitemap detection and resolving for URLs processed. Default is false.

maxCrawlers (Optional, Integer) Number of crawlers to handle the URLs. Default is 1.

maxThreadsPerCrawler

(Optional, Integer) Number of execution threads for a crawler. Default is 2.

metadataFilterField

(Optional, String) The name of the metadata field to filter the documents.

metadataFilterValue

(Optional, String) Comma separated list of metadata values to allow fetch the document.

http.authUserName

(Optional, String) The username to be used for authentication.

http.authPassword

(Optional, String) The password to be used for authentication.

http.authUsernameField

(Optional, String) When using form based authentication, this is the field in the form that holds the username.

http.authPasswordField

(Optional, String) When using form based authentication, this is the field in the form that holds the password.

http.authUrl

(Optional, String) The URL to use to post the form when using form based authentication.

http.authMethod

(Optional, String) The authentication method. Supported values are basic, form, digest

Known limitations

The minimum schedule window for the Website connector is 5 seconds. This means, that at best, it can run every 5 seconds. Otherwise, it will throw an error because there is a execution already running.

  • No labels