Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

{
  "seed": {
    "crawlUrls": [
      "https://www.website.com"
    ],
    "userAgent": "pureinsights-website-connector",
    "metadataFilters": [
      {
        "field": "content-type",
        "values": [
          "text/html; charset=UTF-8"
        ],
        "mode": "exclude"
      }
    ]
  },
  "name": "Website Crawler",
  "id": "5f40c3b9-901c-434d-a816-046464943016",
  "type": "website-connector",
  "pipelineId": "9901132e-5b03-415e-9131-5443282dd0dd",
  "credentialId": "74f4f2e9-1284-4021-ba34-1f7f6b48b85f",
  "timestamp":
"2021-09-24T13:33:05.941392800Z"
}

See Credentials section for parameters details for the optional auth.

...

(Optional, Boolean) Sets whether to ignore sitemap detection and resolving for URLs processed. Default is false.

ignoreCanonicalLinks

(Optional, Boolean) Sets whether canonical links found in HTTP headers and in HTML files section should be ignored or processed. Default is false.

...

(Optional, String) Value specifying if the given values are to be excluded from the result, or should be the only ones included. It can be set to include or exclude, and the default is to include

useWebDriverFetcher

(Optional, boolean) If enabled, an instance of a norconex Web Driver Http Fetcher will be used to crawl sites, instead of the default Generic Http Fetcher. The web driver fetcher can be useful to fetch javascript-driven sites that the generic fetcher has trouble with, but comes with many drawbacks and limitations explained here. Default is false

fetcher.pageLoadTimeout

(Optional, Duration) Max wait time for a page to load, default is unlimited.

fetcher.implicitlyWait

(Optional, Duration) Max wait time for an element to appear, default is unlimited.

fetcher.threadWait

(Optional, Duration) Makes the web driver's thread sleep for the specified duration, to give it enough time to load the page. Sometimes necessary for some web driver implementations if the above options do not work, default is unlimited.

fetcher.waitForElement.type

(Required only if any waitForElement config is used, String) Type of element to wait for. The options available those listed here Any other option will result in an error.

fetcher.waitForElement.selector

(Required only if any waitForElement config is used, String) Reference to element, as per the type specified.

fetcher.waitForElement.duration

(Required only if any waitForElement config is used, Duration) Max wait time for an element to show up in browser before returning.

Working Directory

The component uses MongoDB by default to store the status of the crawl. It is possible to also use a filesytem filesystem working directory, however it is not set because if the HttpCollectorConfig sets that path, unneeded directories will be created and since MongoDB is being used they are not needed. See Norconex documentation for more information about setWorkDir method.

...

This default parsing behaviour can be disabled with the optional skipParsing configuration value. If said value is set to true, then neither will the connector parse crawled files nor place its contents in the records added to the pipeline. The files will instead be saved in the binary data server as they were downloaded. These uploaded files can be accessed in subsequent pipeline stages through the binaryContent field added to each record, which contains the key to retrieve the original file from the BDS.

Credentials

The credential will be ignored if the web driver http fetcher is used. More details here

If required the optional "credentialId" property can be used to reference a target Credential entity. That has , which may have the following configuration parameters:

...

{
  "name": "Website Connector Credentials",
  "type": "website-connector",
  "config": {
    "method": "form",
    "username": "username",
    "password": "password",
    "formUsernameField": "login_username",
    "formPasswordField": "login_password",
    "loginUrl": "https://localhost:8000/loginForm"
  }
}

Web driver fetcher considerations

The web driver fetcher is intended to be used only if a generic http fetcher is unable to crawl a given site. As it relies on an external Selenium WebDriver, the process to fetch pages is usually slower and not as scalable, and may be less stable. In addition to that, the following limitations are to be taken into consideration:

  • Credentials are ignored when using this fetcher
  • The fetcher only supports using the GET HTTP method
  • Most sites' sitemaps can't be read when using this fetcher
  • Metadata filters may not work as intended when using this fetcher. In general, filters that do not rely on HTTP headers should still work, such as Content-Type or document_contentFamily.