Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

{
  "seed": {
    "crawlUrls": [
      "https://www.website.com"
    ],
    "userAgent": "pureinsights-website-connector",
    "metadataFilters": [
      {
        "field": "content-type",
        "values": [
          "text/html; charset=UTF-8"
        ],
        "mode": "exclude"
      }
    ]
  },
  "name": "Website Crawler",
  "id": "5f40c3b9-901c-434d-a816-046464943016",
  "type": "website-connector",
  "pipelineId": "9901132e-5b03-415e-9131-5443282dd0dd",
  "timestamp": "2021-09-24T13:33:05.941392800Z"
}

...

(Optional, String) The name of the metadata field to filter the documents. This parameter is deprecated, new Metadata Filters should be created with the metadataFilters configuration

metadataFilterValue

(Optional, String) Comma separated list of metadata values to allow fetch the document. This parameter is deprecated, new Metadata Filters should be created with the metadataFilters configuration

metadataFilters An array of Filters to be set on the processed documents. Each of the filters must have the next fields

metadataFilters.field

(Optional, String) Name of the metadata field to filter the documents

metadataFilters.

http.authUserName

(Optional, String) The username to be used for authentication.

http.authPassword

(Optional, String) The password to be used for authentication.

http.authUsernameField

(Optional, String) When using form based authentication, this is the field in the form that holds the username.

http.authPasswordField

(Optional, String) When using form based authentication, this is the field in the form that holds the password.

http.authUrl

(Optional, String) The URL to use to post the form when using form based authentication.

http.authMethod

(Optional, String) The authentication method. Supported values are basic, form, digest

Working Directory

The component uses MongoDB by default to store the status of the crawl. It is possible to also use a filesytem working directory, however it is not set because if the HttpCollectorConfig sets that path, unneeded directories will be created and since MongoDB is being used they are not needed. See Norconex documentation for more information about setWorkDir method.

Known limitations

The minimum schedule window for the Website connector is 5 seconds. This means, that at best, it can run every 5 seconds. Otherwise, it will throw an error because there is an execution already running.

Parsing of crawled documents

By default, the library used for crawling, Norconex, attempts to parse all downloaded files and to extract text information from them, as well as metadata. For the vast majority of file types, Norconex will use an Apache Tika parser and therefore the records put in the pipeline will contain Tika's output. More info about how different file types are handled can be found here.

This default parsing behaviour can be disabled with the optional skipParsing configuration value. If said value is set to true, then neither will the connector parse crawled files nor place its contents in the records added to the pipeline. The files will instead be saved in the binary data server as they were downloaded. These uploaded files can be accessed in subsequent pipeline stages through the binaryContent field added to each record, which contains the key to retrieve the original file from the BDS.values

(Optional, String[]) Array of values used to filter the documents

metadataFilters.mode

(Optional, String) Value specifying if the given values are to be excluded from the result, or should be the only ones included. It can be set to include or exclude, and the default is to include


`http.authUserName`

(Optional, String) The username to be used for authentication.

`http.authPassword`

(Optional, String) The password to be used for authentication.

`http.authUsernameField`

(Optional, String) When using form based authentication, this is the field in the form that holds the username.

`http.authPasswordField`

(Optional, String) When using form based authentication, this is the field in the form that holds the password.

`http.authUrl`

(Optional, String) The URL to use to post the form when using form based authentication.

`http.authMethod`

(Optional, String) The authentication method. Supported values are `basic`, `form`, `digest`

# Working Directory
The component uses MongoDB by default to store the status of the crawl.
It is possible to also use a filesytem working directory, however it is not
set because if the HttpCollectorConfig sets that path, unneeded directories
will be created and since MongoDB is being used they are not needed.
See [Norconex documentation](https://opensource.norconex.com/crawlers/web/v3/apidocs/index.html?com/norconex/collector/http/HttpCollectorConfig.html)
for more information about setWorkDir method.

# Known limitations

The minimum schedule window for the Website connector is 5 seconds. This means, that at best, it can run every 5
seconds. Otherwise, it will throw an error because there is an execution already running.

# Parsing of crawled documents

By default, the library used for crawling, Norconex, attempts to parse all downloaded files and to extract text
information from them, as well as metadata. For the vast majority of file types, Norconex will use
an [Apache Tika](https://tika.apache.org/) parser and therefore the records put in the pipeline will contain Tika's
output. More info about how different file types are handled can be
found [here](https://opensource.norconex.com/importer/v3/formats).

This default parsing behaviour can be disabled with the optional `skipParsing` configuration value. If said value
is set to true, then neither will the connector parse crawled files nor place its contents in the records added to the
pipeline. The files will instead be saved in the binary data server as they were downloaded. These uploaded files can be
accessed in subsequent pipeline stages through the `binaryContent` field added to each record, which contains the key to
retrieve the original file from the BDS.