DataWorks Data Integration enables seamless data synchronization across multiple data sources, including MySQL, MaxCompute, Hologres, and Kafka. Data Integration provides batch synchronization, real-time data synchronization, and whole-database migration solutions for use cases like batch ETL, real-time data replication with second-level latency, and whole-database migration.
Synchronization solutions
Solution | Source | Destination | Latency | Use case |
Single-table sync (batch) | A single table | A single table or partition | Daily batch or periodic sync | Periodic full or incremental sync |
Sharded database and table sync (bacth) | Multiple tables sharing identical schema | A single table or partition | Daily or custom intervals | Periodic full, periodic incremental |
Single-table sync (real-time) | A single table | A single table or partition | A few minutes or seconds | Real-time incremental (CDC) |
Sharded database and table sync (real-time) | Multiple logical tables (aggregated from physical tables) | One or multiple tables | A few minutes or seconds | Full + real-time incremental (CDC) |
Whole-database sync (batch) | An entire database or multiple tables | Multiple tables and their partitions | One-time or periodic | One-time/periodic full, one-time/periodic incremental, one-time full + periodic incremental |
Whole-database sync (real-time) | An entire database or multiple tables | Multiple tables and their partitions | A few minutes or seconds | Full + real-time incremental (CDC) |
Whole-database full and incremental sync (near real-time) | An entire database or multiple tables | Multiple tables and their partitions |
| Full + real-time incremental (CDC) |
Recommended synchronization solutions
Choose your data synchronization approach based on these key factors:
Data freshness requirements: batch or real-time.
Data scale and complexibility: The number of tables to sync and their processing logic.
Based on these factors, we recommend two main categories of synchronization solutions: bacth and real-time.
1. Batch synchronization solutions (daily batch or periodic sync)
The solutions are suitable for use cases that do not require high data timeliness (for example, daily batch) and involve periodic batch processing.
Incremental synchronization requires a field to track data changes, such as a timestamp column (last_modified
) or auto-incrementing ID. Without such a field, run full sync periodically instead.
a. Select "Single-table sync (batch)"
Ideal for custom processing logic on a limited number of diverse data sources.
Core advantage: flexible processing logic.
Advanced transformations: Enables complex field mapping, filtering, enrichment, and AI-powered processing.
Heterogeneous source integration: The best choice for processing non-standard data sources like APIs and log files.
Core limitation: expensive to scale.
Configuration overhead: Managing individual tasks becomes costly at scale.
High resource consumption: Each task is scheduled independently. The resource consumption of syncing 100 independent tables is far greater than that of one whole-database task.
See also: Single-table batch synchronization tasks
b. Select "Whole-database sync (batch)"
Efficiently migrate large volumes of homogeneous tables between systems.
Core advantages: High operational efficiency and low cost.
Efficient: Configure hundreds of tables at once with automatic object matching, greatly improving development efficiency.
Cost-effective: Resources are optimized through unified scheduling, resulting in extremely low costs (for example, one whole-database task may consume 2 CUs versus 100 CUs for equivalent single-table tasks).
Typical scenarios: Building the ODS layer of a data warehouse, periodic database backups, and data cloud migration.
Core limitation: Simple processing logic.
Primarily designed for replication and does not support complex transformation logic for individual tables.
Recommended solution: Offline whole-database synchronization tasks.
2. Real-time synchronization solutions (sub-minute latency)
Real-time solutions are suitable for applications that require capturing real-time data changes (inserts, deletes, updates) from the source to support real-time analytics and fast decision-making.
The source must support Change Data Capture (CDC) or is a message queue. For example, MySQL requires binary logging (Binlog), while Kafka functions as a native message queue.
Select "single-table real-time" or "whole-database real-time"
Single-table real-time: Suitable for cases requiring complex processing of real-time change streams from a single table.
Whole-database real-time: The standard solution for building real-time data warehouses and lakes, implementing real-time database disaster recovery. It offers significant advantages in efficiency and cost-effectiveness.
Recommended solutions:Real-time single-table synchronization task; Data Integration-side synchronization task
3. Special case: syncing real-time data to append-only tables
Real-time synchronization captures CDC events including inserts, updates, and deletes. For append-only storage systems like MaxCompute non-Delta tables which do not natively support physical Update
and Delete
operations, writing a raw CDC stream directly results in data inconsistencies (for example, delete operations are ignored).
DataWorks solution: Base + Log tables
This solution resolves the issue by creating a
Base table
(full snapshot) and aLog table
(incremental changes) at the destination.Write method: CDC data streams to the
Log table
in real time. Daily, the system automatically schedules a task to merge the changes from theLog table
into theBase table
, generating an up-to-date full snapshot. This approach ensures that changes are written to the incremental table within minutes and merged into the Base table daily.
Recommended solution: Synchronize full and incremental data in a database to MaxCompute in quasi real time.
Data source read/write capabilities
Data source | Single-table sync (batch) | Single-table sync (real-time) | Whole-database sync (batch) | Whole-database sync (real-time) | Whole-database full and incremental (near real-time) |
Read/Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | Write | Read | Write | - | |
Read/Write | - | Read | - | - | |
Read/Write | - | - | Read | Read | |
Read | - | - | - | - | |
Read | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | Read/Write | - | Write | - | |
Write | Write | - | Write | - | |
Read/Write | - | Read | - | - | |
Read/Write | Write | - | - | - | |
Read/Write | - | Read | - | - | |
Read/Write | - | Read | - | - | |
Elasticsearch | Read/Write | Write | Write | Write | - |
Read/Write | - | - | - | - | |
GBase8a | Read/Write | - | - | - | - |
HBase |
| - | - | - | - |
Read/Write | - | - | - | - | |
Hive | Read/Write | - | Write | - | - |
Read/Write | Read/Write | Read/Write | Write | - | |
Read | - | - | - | - | |
Read/Write | Read/Write | - | Write | - | |
Read/Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | Read | - | - | - | |
Read/Write | Write | Write | - | Write | |
Read/Write | - | - | - | - | |
Write | - | - | - | - | |
Write | - | - | - | - | |
Read | - | - | - | - | |
Write | - | - | - | - | |
Read/Write | - | - | Read | - | |
Read/Write | Read | Read | Read | Read | |
Write | - | - | - | - | |
Read/Write | Read | Read | Read | Read | |
Read/Write | Write | Write | - | - | |
Read/Write | Write | - | - | - | |
Read/Write | Read | Read | Read | Read | |
Read/Write | - | Read | Read | - | |
Read/Write | - | Read | Read | - | |
Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Write | - | - | - | - | |
Read/Write | Write | Write | - | - | |
Read/Write | - | Read | - | - | |
Read/Write | Write | - | - | - | |
Read/Write | - | - | - | - | |
Read/Write | - | - | - | - | |
Write | - | - | - | - | |
Vertica | Read/Write | - | - | - | - |
Read | - | - | - | - |
Use cases
References
The following Data Integration documents help you get started quickly.
For data source configuration, see Data Source Management.
For synchronization task configuration, see:
Offline whole-database synchronization tasks
Synchronize full and incremental data in a database to MaxCompute in quasi real time
For common data synchronization issues, see FAQ.