If you want to use DataWorks to develop and manage CDH (Cloudera's Distribution Including Apache Hadoop, hereinafter referred to as CDH) tasks, you need to first bind your CDH cluster as a CDH computing resource in DataWorks. After binding, you can use this computing resource in DataWorks for data synchronization, development, and other operations.
Prerequisites
DataWorks has created a workspace, and the operator's RAM user has joined the workspace and is set as the workspace administrator role.
A CDH cluster has been deployed.
NoteDataWorks supports using CDH deployed in non-Alibaba Cloud ECS environments, but you need to ensure that the environment where CDH is deployed can connect to Alibaba Cloud virtual private cloud. You can typically use IDC data source network connectivity methods to ensure network connectivity.
A resource group has been bound to the workspace, and network connectivity is ensured.
When using a Serverless resource group, you only need to ensure that the CDH computing resource has normal connectivity with the Serverless resource group.
When using the old exclusive resource group, you need to ensure that the CDH computing resource has normal connectivity with the exclusive scheduling resource group in the corresponding scenario.
Limits
Permission limitations:
Operator
Permission description
Alibaba Cloud account
No additional authorization required.
Alibaba Cloud RAM user/RAM role
Only workspace members with O&M and Workspace Administrator roles, or workspace members with
AliyunDataWorksFullAccess
permission can create computing resources. For authorization details, see Grant user workspace administrator permissions.
Access the computing resource list page
Log on to the DataWorks console, switch to the target region, click in the left-side navigation pane, select the corresponding workspace from the dropdown list, and click Enter Management Center.
In the left-side navigation pane, click Computing Resources to access the computing resource list page.
Bind CDH computing resources
On the computing resource list page, configure the binding of CDH computing resources.
Select the type of computing resource to bind.
Click Bind Computing Resource or Create Computing Resource to access the Bind Computing Resource page.
On the Bind Computing Resource page, select CDH as the computing resource type to access the Bind CDH Computing Resource configuration page.
Configure the CDH computing resource.
On the Bind CDH Computing Resource configuration page, configure according to the following table.
Parameter
Configuration description
Cluster Version
Select the registered cluster version.
DataWorks provides CDH5.16.2, CDH6.1.1, CDH6.2.1, CDH6.3.2, and CDP7.1.7 versions that you can select directly. The component versions (i.e., the versions of each component in the cluster connection information) for these cluster versions are fixed. If these cluster versions do not meet your business needs, you can select Custom Version and configure component versions as needed.
NoteDifferent cluster versions require different component configurations. Please refer to the actual interface for details.
When using a Custom Version cluster registered with DataWorks, only the old exclusive scheduling resource group is supported. After registration, you need to submit a ticket to contact technical support personnel to initialize the relevant environment.
Cluster Name
Select a cluster name already registered in other workspaces to load related configurations, or customize a cluster name to fill in new configurations.
Cluster Connection Information
Hive Connection Information
Used to submit Hive jobs to the cluster.
HiveServer2 configuration format:
jdbc:hive://<host>:<port>/<database>
Metastore configuration format:
thrift://<host>:<port>
Parameter acquisition method: Obtain CDH or CDP cluster information and configure network connectivity
Component version selection: The system will automatically identify the corresponding component version for the current cluster.
NoteIf you use a Serverless resource group to access CDH-related components through domain names, you need to perform authoritative DNS resolution for CDH component domain names in the Internal DNS Resolution (PrivateZone) of Cloud DNS and set the domain name effective scope.
Impala Connection Information
Used to submit Impala jobs.
Configuration format:
jdbc:impala://<host>:<port>/<schema>
.Spark Connection Information
If you need to use the Spark component in DataWorks, you can select the default version here and configure it.
Yarn Connection Information
Task submission and task details viewing configuration.
Yarn.Resourcemanager.Address configuration format:
http://<host>:<port>
NoteSpark or MapReduce task submission address.
Jobhistory.Webapp.Address configuration format:
http://<host>:<port2>
NoteConfigures the Web UI address of the JobHistory Server, which users can access through a browser to view detailed information about historical tasks.
MapReduce Connection Information
If you need to use the MapReduce component in DataWorks, you can select the default version here and configure it.
Presto Connection Information
Used to submit Presto jobs.
JDBC address information configuration format:
jdbc:presto://<host>:<port>/<catalog>/<schema>
NoteNot a default CDH component, configuration needs to be made according to actual circumstances.
Cluster Configuration Files
Configure Core-Site File
Contains global configurations for the Hadoop Core library. For example, commonly used I/O settings for HDFS and MapReduce.
To run Spark or MapReduce tasks, you need to upload this file.
Configure Hdfs-Site File
Contains HDFS-related configurations. For example, block size, replication count, path names, etc.
Configure Mapred-Site File
Used to configure MapReduce-related parameters. For example, configuring the execution method and scheduling behavior of MapReduce jobs.
To run MapReduce tasks, you need to upload this file.
Configure Yarn-Site File
Contains all configurations related to YARN daemon processes. For example, environment configurations for resource managers, node managers, and application runtime.
You must upload this file to run Spark or MapReduce jobs, or when you select Kerberos as the account mapping type.
Configure Hive-Site file
Contains parameters for configuring Hive. For example, database connection information, Hive Metastore settings, and execution engines.
You need to upload this file when the account mapping type is set to Kerberos.
Configure Spark-Defaults File
Used to specify default configurations applied when executing Spark jobs. You can preset a series of parameters (such as memory size, CPU cores) through the
spark-defaults.conf
file, which Spark applications will adopt when running.To run Spark tasks, you need to upload this file.
Configure Config.Properties File
Contains configurations related to the Presto server. For example, setting global properties for coordinator nodes and worker nodes in the Presto cluster.
When using the Presto component and the account mapping type is selected as OPEN LDAP or Kerberos, you need to upload this file.
Configure Presto.Jks File
Used to store security certificates, including private keys and public key certificates issued to applications. In the Presto database query engine, the
presto.jks
file is used to enable SSL/TLS encrypted communication for Presto processes, ensuring the security of data transmission.Default Access Identity
If you select to use mapped cluster account identities, on the Computing Resources list page, find the Account Mapping tab and click Configure cluster identity mapping.
Development environment: You can select cluster account or cluster account mapped to the task executor.
Production environment: You can select cluster account, cluster account mapped to the task owner, cluster account mapped to the Alibaba Cloud account, or cluster account mapped to the Alibaba Cloud RAM user.
Computing Resource Instance Name
Customize the computing resource instance name. When running tasks, you can select the computing resource for task execution based on the computing resource name.
Click Confirm to complete the CDH computing resource configuration.
Resource group initialization
For first-time cluster registration or cluster service configuration changes (such as modifying core-site.xml), please initialize the resource group to ensure that the resource group can normally access the CDH cluster through network connectivity configuration.
On the Computing Resources list page, find the CDH computing resource you created. Click Resource Group Initialization in the upper right corner.
Click Initialize next to the required resource group. After the resource group initialization is successful, click OK.
(Optional) Set YARN resource queue
You can find the CDH cluster you bound on the Computing Resources list page, click the YARN Resource Queue tab, and click Edit YARN Resource Queue to set dedicated YARN resource queues for tasks in different modules.
(Optional) Set SPARK parameters
Set dedicated SPARK property parameters for tasks in different modules.
Find the CDH cluster you bound on the Computing Resources list page.
Click the Edit SPARK Parameters button on the SPARK Parameters tab to enter the edit page for CDH cluster SPARK parameters.
By clicking the Add button below the module, enter the Spark Property Name and corresponding Spark Property Value to set Spark property information.
What to do next
After configuring the CDH computing resource, you can perform data development operations through CDH-related nodes in data development.