Based on Norconex HTTP collector, allows crawling of websites to fetch content and discover links and page structure within a starting URL.
Configuration
Example configuration in a seed:
{
"seed": {
"crawlUrls": [
"https://www.website.com"
],
"userAgent": "pureinsights-website-connector"
},
"name": "Website Crawler",
"id": "5f40c3b9-901c-434d-a816-046464943016",
"type": "website-connector",
"pipelineId": "9901132e-5b03-415e-9131-5443282dd0dd",
"timestamp": "2021-09-24T13:33:05.941392800Z"
}
Configuration parameters:
crawlUrls
(Required, Array of Strings) A list of start URLs that need to be crawled. The URLs can be from different domains. Robots.txt and sitemaps will be used and respected if present.
userAgent
(Optional, String) Used to identify requests against the target websites. Is recommended to set it to something
identifiable. This will allow website
owners to identify traffic from the connector and possibly avoid it being blocked or throttled. Defaults
to pureinsights-website-connector
.
maxDepth
(Optional, Int) if set, will limit the number of references to follow from a page start to finish. Default is unlimited.
ignoreSiteMap
(Optional, Boolean) Sets whether to ignore sitemap detection and resolving for URLs processed. Default is false.
skipParsing
(Optional, Boolean) Sets whether to upload crawled documents to the binary data service, or parse them and add their
text info to the records instead. if set to true, the id
of the records will be a hash of the URL crawled, otherwise the URL itself will be used.
For a more detailed explanation, see the Parsing of crawled documents section. Defaults
to false.
maxCrawlers
(Optional, Integer) Number of crawlers to handle the URLs. Default is 1.
maxThreadsPerCrawler
(Optional, Integer) Number of execution threads for a crawler. Default is 2.
metadataFilterField
(Optional, String) The name of the metadata field to filter the documents.
metadataFilterValue
(Optional, String) Comma separated list of metadata values to allow fetch the document.
http.authUserName
(Optional, String) The username to be used for authentication.
http.authPassword
(Optional, String) The password to be used for authentication.
http.authUsernameField
(Optional, String) When using form based authentication, this is the field in the form that holds the username.
http.authPasswordField
(Optional, String) When using form based authentication, this is the field in the form that holds the password.
http.authUrl
(Optional, String) The URL to use to post the form when using form based authentication.
http.authMethod
(Optional, String) The authentication method. Supported values are basic
, form
, digest
Working Directory
The component uses MongoDB by default to store the status of the crawl. It is possible to also use a filesytem working directory, however it is not set because if the HttpCollectorConfig sets that path, unneeded directories will be created and since MongoDB is being used they are not needed. See Norconex documentation for more information about setWorkDir method.
Known limitations
The minimum schedule window for the Website connector is 5 seconds. This means, that at best, it can run every 5 seconds. Otherwise, it will throw an error because there is an execution already running.
Parsing of crawled documents
By default, the library used for crawling, Norconex, attempts to parse all downloaded files and to extract text information from them, as well as metadata. For the vast majority of file types, Norconex will use an Apache Tika parser and therefore the records put in the pipeline will contain Tika's output. More info about how different file types are handled can be found here.
This default parsing behaviour can be disabled with the optional skipParsing
configuration value. If said value
is set to true, then neither will the connector parse crawled files nor place its contents in the records added to the
pipeline. The files will instead be saved in the binary data server as they were downloaded. These uploaded files can be
accessed in subsequent pipeline stages through the binaryContent
field added to each record, which contains the key to
retrieve the original file from the BDS.