HTML Processor
HTML Processor
This processor parses the content of a html file and extracts subset of sections by CSS selectors with the action 'process'.
API
Jsoup. A Java library for working with real-world HTML. Version 1.14.3
Action: Process
Extracts HTML content from a field by using CSS selectors.
Configuration
Sample configuration in a processor:
{
"name": "html processor",
"type": "html-processor",
"html": {
"key": "file",
"url": "url",
"output": "outField",
"select": [
".cl1"
]
}
}
Configuration parameters:
key
- Required, string
Name of the record's field containing the html file content
url
- Optional, string, default is null.
Name of the record's field where the html page was downloaded from. This is used by Jsoup only to anchor relative links and is not required.
output
- Required, string
Field name where output data will be placed.
select
- Optional, string | list
List of CSS selectors to extract from the html document.
Any item with one of the classes will be extracted into the output.
For example <div class="cl1">
. For more information on selectors you can visit Jsoup official documentation.
multipleOutputSelect
- Optional, Array
multipleOutputSelect.selector
- Required, string
The HTML/CSS selector to identify elements to be matched. Any item with one of the classes will be extracted into the
matching Value for the Key. If there are no matching selectors field will not be appended to the output.
multipleOutputSelect.output
- Required, string
Name of the output field for the returned values.
multipleOutputSelect.toArray
- Optional, boolean
If set to true will produce a list of elements for each matching separator and if set to
false uses the multipleOutputSeparator
to set the string to separate multiple matches. Defaults to false.
multipleOutputSeparator
- Optional, String
When using the multipleOutputSelect
, this can be set to define the string separator between multiple matching selectors.
If not set, the default behavior will be to append one after the other.
autoEncoding
- Optional, Boolean
If true the Jsoup parse will try to parse the target encoding from metadata information, if set to false it will
use default UTF-8 encoding.
Input/Output examples
Multiple Output Select Example
Configuration
{
"name": "html processor",
"type": "html-processor",
"html": {
"key": "file",
"url": "url",
"output": "outField",
"multipleOutputSelect": [
{
"selector": "h1#title",
"output": "title",
"toArray": false
},
{
"selector": "div#mainContent",
"output": "body",
"toArray": true
}
],
"onlyText": true
}
}
Input
Using the sample configuration:
<html>
<h1 id="title"> This is a title</h1>
<div id="mainContent"> This is the content</div>
</html>
Output
The output is a Map with matching content
{
"outField": {
"title": "This is a title",
"body": ["This is the content"]
}
}
Action: Extract
Configuration
Sample configuration in a processor:
{
"html": {
"key": "file",
"url": "url",
"output": "outField",
"titles": [
"h0","h1","h2","h3","h4","h5","h6","title","caption"
]
},
"name": "html processor",
"type": "html-processor"
}
Configuration parameters:
key
- Required, string
Name of the record's field containing the html file content
url
- Optional, string, default is null.
Name of the record's field where the html page was downloaded from. This is used by Jsoup only to anchor relative links and is not required.
output
- Required, string
Field name where output data will be placed.
titles
- Optional, string list
List of tags that will be considered as headers for the table or description lists, it will count as a header if one of the specified tags corresponds to the first child of the element or if it's the element's previous sibling. For example:
<table>
<caption>Table header</caption> -> First child
<tr>
<th>Header1</th>
</tr>
<tr>
<td>Text1</td>
</tr>
</table>
<h1>Table header 1</h1> -> Previous sibling
<table>
<tr>
<th>Header1</th>
</tr>
<tr>
<td>Text1</td>
</tr>
</table>
Input/Output examples
Extract Examples
Input
Using the sample configuration:
<html>
<body>
<h1>Description list header</h1>
<dl>
<dt>Title1</dt>
<dd>Description1</dd>
<dt>Title2</dt>
<dd>Description2</dd>
</dl>
<table>
<caption>Table header</caption>
<tr>
<th>Header1</th>
<th>Header2</th>
</tr>
<tr>
<td>Text1</td>
<td>Text2</td>
</tr>
</table>
</body>
</html>
Would look as follows on the record:
{
"file": "<html>\n<body>\n<h1>Description list header</h1>\n<dl>\n<dt>Title1</dt>......"
}
Output
The output is a list of every table and description list found in the html, the elements are grouped by rows.Every element has a type that describes its role in the html and his corresponding text
{
"outField": [
{
"header": "Description list header",
"description_list": [
[{
"type": "title",
"text": "Title1"
},
{
"type": "description",
"text": "Description1"
}],
[{
"type": "title",
"text": "Title2"
},
{
"type": "description",
"text": "Description2"
}]
]
},
{
"header": "Table header",
"table": [
[{
"type": "header",
"text": "Header1"
},
{
"type": "header",
"text": "Header2"
}],
[{
"type": "value",
"text": "Text1"
},
{
"type": "value",
"text": "Text2"
}]
]
}
]
}
©2024 Pureinsights Technology Corporation