...
{
"seed": {
"crawlUrls": [
"https://www.website.com"
],
"userAgent": "pureinsights-website-connector",
"metadataFilters": [
{
"field": "content-type",
"values": [
"text/html; charset=UTF-8"
],
"mode": "exclude"
}
]
},
"name": "Website Crawler",
"id": "5f40c3b9-901c-434d-a816-046464943016",
"type": "website-connector",
"pipelineId": "9901132e-5b03-415e-9131-5443282dd0dd",
"credentialId": "74f4f2e9-1284-4021-ba34-1f7f6b48b85f",
"timestamp": "2021-09-24T13:33:05.941392800Z"
}
See Credentials section for parameters details for the optional auth.
Configuration parameters:
...
(Optional, Int) if set, will limit the number of references to follow from a page start to finish. Default is unlimited.
maxDocuments
(Optional, Int) if set, will limit the number of references to be processed. A processed reference is one that was read from the crawler queue in an attempt to fetch its corresponding document, whether that attempt was successful or not. Default is unlimited.
ignoreSiteMap
...
(Optional, Boolean) Sets whether to upload crawled documents to the binary data service, or parse them and add their text info to the records instead. if set to true, the id
of the records will be a hash of the URL crawled, otherwise the URL itself will be used. For a more detailed explanation, see the Parsing of crawled documents section. Defaults to false.
importerConfig
(Optional, String) File server file key pointing to a Norconex importer configuration file to load it into the crawlers (See Importer Configuration). This configuration is ignored when skipParsing
is enabled.
maxCrawlers
(Optional, Integer) Number of crawlers to handle the URLs. Default is 1.
...
(Optional, String) Value specifying if the given values are to be excluded from the result, or should be the only ones included. It can be set to include
or exclude
, and the default is to include
...
Working
...
Directory
...
The
...
component
...
uses
...
MongoDB
...
by
...
default
...
to
...
store
...
the
...
status
...
of
...
the
...
crawl.
...
It
...
is
...
possible
...
to
...
also
...
use
...
a
...
filesytem
...
working
...
directory,
...
however
...
it
...
is
...
not
...
set
...
because
...
if
...
the
...
HttpCollectorConfig
...
sets
...
that
...
path,
...
unneeded
...
directories
...
will
...
be
...
created
...
and
...
since
...
MongoDB
...
is
...
being
...
used
...
they
...
are
...
not
...
needed.
...
See
...
...
...
for
...
more
...
information
...
about
...
setWorkDir
...
method.
...
Known
...
limitations
...
The
...
minimum
...
schedule
...
window
...
for
...
the
...
Website
...
connector
...
is
...
5
...
seconds.
...
This
...
means,
...
that
...
at
...
best,
...
it
...
can
...
run
...
every
...
5
...
seconds.
...
Otherwise,
...
it
...
will
...
throw
...
an
...
error
...
because
...
there
...
is
...
an
...
execution
...
already
...
running.
...
Parsing
...
of
...
crawled
...
documents
...
By
...
default,
...
the
...
library
...
used
...
for
...
crawling,
...
Norconex,
...
attempts
...
to
...
parse
...
all
...
downloaded
...
files
...
and
...
to
...
extract
...
text
...
information
...
from
...
them,
...
as
...
well
...
as
...
metadata.
...
For
...
the
...
vast
...
majority
...
of
...
file
...
types,
...
Norconex
...
will
...
use
...
an
...
...
...
parser
...
and
...
therefore
...
the
...
records
...
put
...
in
...
the
...
pipeline
...
will
...
contain
...
Tika's
...
output.
...
More
...
info
...
about
...
how
...
different
...
file
...
types
...
are
...
handled
...
can
...
be
...
found
...
...
.
This default parsing behaviour can be disabled with the optional skipParsing
configuration value. If said value is set to true, then neither will the connector parse crawled files nor place its contents in the records added to the pipeline. The files will instead be saved in the binary data server as they were downloaded. These uploaded files can be accessed in subsequent pipeline stages through the binaryContent
field added to each record, which contains the key to retrieve the original file from the BDS.
Credentials
If required the optional "credentialId" property can be used to reference a target Credential entity. That has the following configuration parameters:
method
(Required, String) Method to be used to fill the login. Currently only "form" is supported.
username
(Required, String) Username to be used for login.
password
(Required, String) Password to be used for login.
formUsernameField
(Required, String) Name of the target field where the username will be placed.
formPasswordField
(Required, String) Name of the target field where the password will be placed.
loginUrl
(Required, String) URL where the login will be executed.
Example configuration:
{
"name": "Website Connector Credentials",
"type": "website-connector",
"config": {
"method": "form",
"username": "username",
"password": "password",
"formUsernameField": "login_username",
"formPasswordField": "login_password",
"loginUrl": "https://localhost:8000/loginForm"
}
}