...

{
  "seed": {
    "crawlUrls": [
      "https://www.website.com"
    ],
    "userAgent": "pureinsights-website-connector",
    "metadataFilters": [
      {
        "field": "content-type",
        "values": [
          "text/html; charset=UTF-8"
        ],
        "mode": "exclude"
      }
    ]
  },
  "name": "Website Crawler",
  "id": "5f40c3b9-901c-434d-a816-046464943016",
  "type": "website-connector",
  "pipelineId": "9901132e-5b03-415e-9131-5443282dd0dd",
  "credentialId": "74f4f2e9-1284-4021-ba34-1f7f6b48b85f",
  "timestamp": "2021-09-24T13:33:05.941392800Z"
}

See Credentials section for parameters details for the optional auth.

Configuration parameters:

...

(Optional, Int) if set, will limit the number of references to follow from a page start to finish. Default is unlimited.

maxDocuments

(Optional, Int) if set, will limit the number of references to be processed. A processed reference is one that was read from the crawler queue in an attempt to fetch its corresponding document, whether that attempt was successful or not. Default is unlimited.

ignoreSiteMap

...

(Optional, Boolean) Sets whether to upload crawled documents to the binary data service, or parse them and add their text info to the records instead. if set to true, the id of the records will be a hash of the URL crawled, otherwise the URL itself will be used. For a more detailed explanation, see the Parsing of crawled documents section. Defaults to false.

importerConfig

(Optional, String) File server file key pointing to a Norconex importer configuration file to load it into the crawlers (See Importer Configuration). This configuration is ignored when skipParsing is enabled.

maxCrawlers (Optional, Integer) Number of crawlers to handle the URLs. Default is 1.

...

(Optional, String) Value specifying if the given values are to be excluded from the result, or should be the only ones included. It can be set to include or exclude, and the default is to include

...

Working

...

The
...
component
...
uses
...
MongoDB
...
by
...
default
...
to
...
store
...
the
...
status
...
of
...
the
...
crawl.
...
It
...
is
...
possible
...
to
...
also
...
use
...
a
...
filesytem
...
working
...
directory,
...
however
...
it
...
is
...
not
...
set
...
because
...
if
...
the
...
HttpCollectorConfig
...
sets
...
that
...
path,
...
unneeded
...
directories
...
will
...
be
...
created
...
and
...
since
...
MongoDB
...
is
...
being
...
used
...
they
...
are
...
not
...
needed.
...
See
...
Norconex
...
documentation
...
for
...
more
...
information
...
about
...
setWorkDir
...
method.
...

Known

...

limitations

...

The
...
minimum
...
schedule
...
window
...
for
...
the
...
Website
...
connector
...
is
...
5
...
seconds.
...
This
...
means,
...
that
...
at
...
best,
...
it
...
can
...
run
...
every
...
5
...
seconds.
...
Otherwise,
...
it
...
will
...
throw
...
an
...
error
...
because
...
there
...
is
...
an
...
execution
...
already
...
running.
...

Parsing

...

of

...

crawled

...

documents

...

By
...
default,
...
the
...
library
...
used
...
for
...
crawling,
...
Norconex,
...
attempts
...
to
...
parse
...
all
...
downloaded
...
files
...
and
...
to
...
extract
...
text
...
information
...
from
...
them,
...
as
...
well
...
as
...
metadata.
...
For
...
the
...
vast
...
majority
...
of
...
file
...
types,
...
Norconex
...
will
...
use
...
an
...
Apache
...
Tika
...
parser
...
and
...
therefore
...
the
...
records
...
put
...
in
...
the
...
pipeline
...
will
...
contain
...
Tika's
...
output.
...
More
...
info
...
about
...
how
...
different
...
file
...
types
...
are
...
handled
...
can
...
be
...
found
...
here
...
.
This default parsing behaviour can be disabled with the optional `skipParsing` configuration value. If said value is set to true, then neither will the connector parse crawled files nor place its contents in the records added to the pipeline. The files will instead be saved in the binary data server as they were downloaded. These uploaded files can be accessed in subsequent pipeline stages through the `binaryContent` field added to each record, which contains the key to retrieve the original file from the BDS.

Credentials

If required the optional "credentialId" property can be used to reference a target Credential entity. That has the following configuration parameters:

method (Required, String) Method to be used to fill the login. Currently only "form" is supported.

username (Required, String) Username to be used for login.

password (Required, String) Password to be used for login.

formUsernameField (Required, String) Name of the target field where the username will be placed.

formPasswordField (Required, String) Name of the target field where the password will be placed.

loginUrl (Required, String) URL where the login will be executed.

Example configuration:

{
  "name": "Website Connector Credentials",
  "type": "website-connector",
  "config": {
    "method": "form",
    "username": "username",
    "password": "password",
    "formUsernameField": "login_username",
    "formPasswordField": "login_password",
    "loginUrl": "https://localhost:8000/loginForm"
  }
}

Versions Compared

Old Version 32

New Version 33

Key

Working

Directory

Known

limitations

Parsing

of

crawled

documents

Credentials

Page Comparison

Versions Compared

Old Version 32

New Version 33

Key

Working

Directory

Known

limitations

Parsing

of

crawled

documents

Credentials