HTML Processor

HTML Processor

This processor parses the content of a html file and extracts subset of sections by CSS selectors with the action 'process'.

API

Jsoup. A Java library for working with real-world HTML. Version 1.14.3

Action: Process

Extracts HTML content from a field by using CSS selectors.

Configuration

Sample configuration in a processor:

{
  "name": "html processor",
  "type": "html-processor",
  "html": {
    "key": "file",
    "url": "url",
    "output": "outField",
    "select": [
      ".cl1"
    ]
  }
}

Configuration parameters:

key - Required, string

Name of the record's field containing the html file content

url - Optional, string, default is null.

Name of the record's field where the html page was downloaded from. This is used by Jsoup only to anchor relative links and is not required.

output - Required, string

Field name where output data will be placed.

select - Optional, string | list

List of CSS selectors to extract from the html document. Any item with one of the classes will be extracted into the output. For example <div class="cl1">. For more information on selectors you can visit Jsoup official documentation.

multipleOutputSelect - Optional, Array

multipleOutputSelect.selector - Required, string The HTML/CSS selector to identify elements to be matched. Any item with one of the classes will be extracted into the matching Value for the Key. If there are no matching selectors field will not be appended to the output.

multipleOutputSelect.output - Required, string Name of the output field for the returned values.

multipleOutputSelect.toArray - Optional, boolean If set to true will produce a list of elements for each matching separator and if set to false uses the multipleOutputSeparator to set the string to separate multiple matches. Defaults to false.

multipleOutputSeparator - Optional, String When using the multipleOutputSelect, this can be set to define the string separator between multiple matching selectors. If not set, the default behavior will be to append one after the other.

autoEncoding - Optional, Boolean If true the Jsoup parse will try to parse the target encoding from metadata information, if set to false it will use default UTF-8 encoding.

Input/Output examples

Multiple Output Select Example
Configuration
{
  "name": "html processor",
  "type": "html-processor",
  "html": {
    "key": "file",
    "url": "url",
    "output": "outField",
    "multipleOutputSelect": [
      {
        "selector": "h1#title",
        "output": "title",
        "toArray": false
      },
      {
        "selector": "div#mainContent",
        "output": "body",
        "toArray": true
      }
    ],
    "onlyText": true
  }
}
Input

Using the sample configuration:

<html>
  <h1 id="title"> This is a title</h1>
  <div id="mainContent"> This is the content</div>
</html>
Output

The output is a Map with matching content

{
  "outField": {
    "title": "This is a title",
    "body": ["This is the content"]
  }
}

Action: Extract

Configuration

Sample configuration in a processor:

{
  "html": {
    "key": "file",
    "url": "url",
    "output": "outField",
    "titles": [
      "h0","h1","h2","h3","h4","h5","h6","title","caption"
    ]
  },
  "name": "html processor",
  "type": "html-processor"
}

Configuration parameters:

key - Required, string

Name of the record's field containing the html file content

url - Optional, string, default is null.

Name of the record's field where the html page was downloaded from. This is used by Jsoup only to anchor relative links and is not required.

output - Required, string

Field name where output data will be placed.

titles - Optional, string list

List of tags that will be considered as headers for the table or description lists, it will count as a header if one of the specified tags corresponds to the first child of the element or if it's the element's previous sibling. For example:

<table>
  <caption>Table header</caption> -> First child 
  <tr>
    <th>Header1</th>
  </tr>
  <tr>
    <td>Text1</td>
  </tr>
</table>

<h1>Table header 1</h1> -> Previous sibling
<table>
  <tr>
    <th>Header1</th>
  </tr>
  <tr>
    <td>Text1</td>
  </tr>
</table>

Input/Output examples

Extract Examples
Input

Using the sample configuration:

<html>
    <body>
        <h1>Description list header</h1>
        <dl>
          <dt>Title1</dt>
          <dd>Description1</dd>
        
          <dt>Title2</dt>
          <dd>Description2</dd>
        
        </dl>
        
        <table>
          <caption>Table header</caption>
          <tr>
            <th>Header1</th>
            <th>Header2</th>
          </tr>
          <tr>
            <td>Text1</td>
            <td>Text2</td>
          </tr>
        </table>
    </body>
</html>

Would look as follows on the record:

{
  "file": "<html>\n<body>\n<h1>Description list header</h1>\n<dl>\n<dt>Title1</dt>......"
 }

Output

The output is a list of every table and description list found in the html, the elements are grouped by rows.Every element has a type that describes its role in the html and his corresponding text

{
  "outField": [
    {
      "header": "Description list header",
      "description_list": [
        [{
            "type": "title",
            "text": "Title1"
          },
          {
            "type": "description",
            "text": "Description1"
          }],
        [{
            "type": "title",
            "text": "Title2"
          },
          {
            "type": "description",
            "text": "Description2"
          }]
      ]
    },
    {
      "header": "Table header",
      "table": [
        [{
            "type": "header",
            "text": "Header1"
          },
          {
            "type": "header",
            "text": "Header2"
          }],
        [{
            "type": "value",
            "text": "Text1"
          },
          {
            "type": "value",
            "text": "Text2"
          }]
      ]
    }
  ]
}

©2024 Pureinsights Technology Corporation