Website Connector
Based on Norconex HTTP collector, allows crawling of websites to fetch content and discover links and page structure within a starting URL.
Configuration
Example configuration in a seed:
{
"seed": {
"crawlUrls": [
"https://www.website.com"
],
"userAgent": "pureinsights-website-connector",
"metadataFilters": [
{
"field": "content-type",
"values": [
"text/html; charset=UTF-8"
],
"mode": "exclude"
}
]
},
"name": "Website Crawler",
"id": "5f40c3b9-901c-434d-a816-046464943016",
"type": "website-connector",
"pipelineId": "9901132e-5b03-415e-9131-5443282dd0dd",
"credentialId": "74f4f2e9-1284-4021-ba34-1f7f6b48b85f"
}
See Credentials section for parameters details for the optional auth.
Configuration parameters:
crawlUrls
(Required, Array of Strings) A list of start URLs that need to be crawled. The URLs can be from different domains. Robots.txt and sitemaps will be used and respected if present.
userAgent
(Optional, String) Used to identify requests against the target websites. Is recommended to set it to something
identifiable. This will allow website
owners to identify traffic from the connector and possibly avoid it being blocked or throttled. Defaults
to pureinsights-website-connector
.
maxDepth
(Optional, Int) if set, will limit the number of references to follow from a page start to finish. Default is unlimited.
maxDocuments
(Optional, Int) if set, will limit the number of references to be processed. A processed reference is one that was read from the crawler queue in an attempt to fetch its corresponding document, whether that attempt was successful or not. Default is unlimited.
ignoreSiteMap
(Optional, Boolean) Sets whether to ignore sitemap detection and resolving for URLs processed. Default is false.
ignoreCanonicalLinks
(Optional, Boolean) Sets whether canonical links found in HTTP headers and in HTML files section should be ignored or processed. Default is false.
skipParsing
(Optional, Boolean) Sets whether to upload crawled documents to the binary data service, or parse them and add their
text info to the records instead. if set to true, the id
of the records will be a hash of the URL crawled, otherwise the URL itself will be used.
For a more detailed explanation, see the Parsing of crawled documents section. Defaults
to false.
importerConfig
(Optional, String) File server file key pointing to a Norconex importer configuration file to load it into the crawlers
(See Importer Configuration). This configuration is ignored
when skipParsing
is enabled.
maxCrawlers
(Optional, Integer) Number of crawlers to handle the URLs. Default is 1.
maxThreadsPerCrawler
(Optional, Integer) Number of execution threads for a crawler. Default is 2.
metadataFilterField
(Optional, String) The name of the metadata field to filter the documents. This parameter is deprecated, new Metadata Filters
should be created with the metadataFilters
configuration
metadataFilterValue
(Optional, String) Comma separated list of metadata values to allow fetch the document. This parameter is deprecated, new Metadata Filters
should be created with the metadataFilters
configuration
metadataFilters
An array of Filters to be set on the processed documents. Each of the filters must have the next fields
metadataFilters.field
(Optional, String) Name of the metadata field to filter the documents
metadataFilters.values
(Optional, String[]) Array of values used to filter the documents
metadataFilters.mode
(Optional, String) Value specifying if the given values are to be excluded from the result, or should be the only ones included.
It can be set to include
or exclude
, and the default is to include
useWebDriverFetcher
(Optional, boolean) If enabled, an instance of a norconex Web Driver Http Fetcher will be used to crawl sites, instead of the default Generic Http Fetcher. The web driver fetcher can be useful to fetch javascript-driven sites that the generic fetcher has trouble with, but comes with many drawbacks and limitations explained here. Default is false
fetcher.pageLoadTimeout
(Optional, Duration) Max wait time for a page to load, default is unlimited.
fetcher.implicitlyWait
(Optional, Duration) Max wait time for an element to appear, default is unlimited.
fetcher.threadWait
(Optional, Duration) Makes the web driver's thread sleep for the specified duration, to give it enough time to load the page. Sometimes necessary for some web driver implementations if the above options do not work, default is unlimited.
fetcher.waitForElement.type
(Required only if any waitForElement
config is used, String) Type of element to wait for. The options available those
listed here
Any other option will result in an error.
fetcher.waitForElement.selector
(Required only if any waitForElement
config is used, String) Reference to element, as per the type specified.
fetcher.waitForElement.duration
(Required only if any waitForElement
config is used, Duration) Max wait time for an element to show up in browser before returning.
Working Directory
The component uses MongoDB by default to store the status of the crawl. It is possible to also use a filesystem working directory, however it is not set because if the HttpCollectorConfig sets that path, unneeded directories will be created and since MongoDB is being used they are not needed. See Norconex documentation for more information about setWorkDir method.
Known limitations
The minimum schedule window for the Website connector is 5 seconds. This means, that at best, it can run every 5 seconds. Otherwise, it will throw an error because there is an execution already running.
Parsing of crawled documents
By default, the library used for crawling, Norconex, attempts to parse all downloaded files and to extract text information from them, as well as metadata. For the vast majority of file types, Norconex will use an Apache Tika parser and therefore the records put in the pipeline will contain Tika's output. More info about how different file types are handled can be found here.
This default parsing behaviour can be disabled with the optional skipParsing
configuration value. If said value
is set to true, then neither will the connector parse crawled files nor place its contents in the records added to the
pipeline. The files will instead be saved in the binary data server as they were downloaded. These uploaded files can be
accessed in subsequent pipeline stages through the binaryContent
field added to each record, which contains the key to
retrieve the original file from the BDS.
Credentials
The credential will be ignored if the web driver http fetcher is used. More details here
If required the optional credentialId
property can be used to reference a target Credential
entity, which may have the following
configuration parameters:
method
(Required, String) Method to be used to fill the login. Currently only "form" is supported.
username
(Required, String) Username to be used for login.
password
(Required, String) Password to be used for login.
formUsernameField
(Required, String) Name of the target field where the username will be placed.
formPasswordField
(Required, String) Name of the target field where the password will be placed.
loginUrl
(Required, String) URL where the login will be executed.
Example configuration:
{
"name": "Website Connector Credentials",
"type": "website-connector",
"config": {
"method": "form",
"username": "username",
"password": "password",
"formUsernameField": "login_username",
"formPasswordField": "login_password",
"loginUrl": "https://localhost:8000/loginForm"
}
}
Web driver fetcher considerations
The web driver fetcher is intended to be used only if a generic http fetcher is unable to crawl a given site. As it relies on an external Selenium WebDriver, the process to fetch pages is usually slower and not as scalable, and may be less stable. In addition to that, the following limitations are to be taken into consideration:
- Credentials are ignored when using this fetcher
- The fetcher only supports using the
GET
HTTP method - Most sites' sitemaps can't be read when using this fetcher
- Metadata filters may not work as intended when using this fetcher. In general, filters that do not rely on HTTP headers should still work, such as
Content-Type
ordocument_contentFamily
.
©2024 Pureinsights Technology Corporation