Elasticsearch walkthrough

note

Currently supported:

Point In Time API Version 7.10 and above
Mapping API Version 7.10 and above
Indices/Aliases API Version 7.10 and above. Indices are the default option; if you want to use Aliases, contact the support team at helpme@rivery.io.

Elasticsearch is a distributed, RESTful search and analytics engine that can handle a wide range of scenarios. Elastic Stack saves your data centrally for lightning-fast search, fine-tuned relevancy, and scalable analytics.

Connection

To connect to Elasticsearch, refer to Elastic connection topic.

important

When configuring a connection, you can select Elasticsearch if you are connecting to an Elasticsearch cluster, or OpenSearch if you are using an OpenSearch cluster.

After establishing a connection, Elasticsearch offers features for integrating data into a cloud target.

Pulling Elasticsearch data into a target

Elasticsearch uses a multi-table mode and a Standard Extraction Mode by default, letting you load multiple Indices/Aliases to your target at the same time.

Procedure

Navigate to the Data Integration Account.
Select the connection you created.
Set your target.
Click the Schema tab and wait for all indices to load:
Select the index, and the document's keys appear.

note

The fields are used to preserve the format of the original date and date nanos fields, which are incompatible with some programming languages, while adding a new date field with the yyyy-MM-dd HH:mm: ss.SSS (Epoch time in milliseconds) format makes it possible.
You can enable these fields or leave them blank. They do not affect your data if you select them, but they do create a new date field format for each date and date nanos field.**

Each new field is prefixed with 'es_format_' and saved in the table.

Select Table Settings:
- The Filter field in the Table Settings tab can be filled in with the same query that you use in Elastic's Dev tools to filter the search results.
- A single Object or an Array can be used as the filter. Here are some examples: Object - This identifies the document's Status field, which contains the Published parameter.
```
{ "term":  { "status": "published" }}
```
  Array - This retrieves all documents with the Status field containing the Published parameter after the specified date.
```
 
{ "term":  { "status": "published" }},
{ "range": { "publish_date": { "gte": "2015-01-01" }}}
 
```
  Ensure you copy the query under the Filter without the square brackets.
- The default output format for Date Fields is yyyy-MM-dd HH:mm: ss.SSS.
You can extract data in two ways:
- All (Default)
- Incremental
Choose All to retrieve all data regardless of time periods or select Incremental to control the date range of your report. Only Date and Date Nanos would be received in Incremental fields.

note

Start Date is mandatory.
Data can be retrieved for the date range specified between the start and end dates.
If you leave the end date blank, the data will be pulled at the current time of the river's run.
Dates timezone: UTC.
Use the "Last Days Back For Each Run" option to gather data from a specified number of days before the selected start date.

Limitations

Periods are automatically replaced with underscores. If more than one index matches, these Indices are skipped and not included in the River.

Consider the following scenario - If the original Indices are as follows: .Kibana_1 _Kibana.1

They are automatically converted to _Kibana_1, and the River cannot make a difference; therefore, they are excluded to avoid data discrepancies.

The default fetching page size from Elastic is 10,000. Ensure that this is dependent on the CPU/memory of the elastic nodes. If the volume of data (documents) is substantial and the CPU/memory resources of the elastic node are limited, a request for more than 100 per page size will fail. This issue is available in the logs. If you suspect this is the problem, contact support team.