Bind CDH computing resources

If you want to use DataWorks to develop and manage CDH (Cloudera's Distribution Including Apache Hadoop, hereinafter referred to as CDH) tasks, you need to first bind your CDH cluster as a CDH computing resource in DataWorks. After binding, you can use this computing resource in DataWorks for data synchronization, development, and other operations.

Prerequisites

DataWorks has created a workspace, and the operator's RAM user has joined the workspace and is set as the workspace administrator role.
A CDH cluster has been deployed.
Note
DataWorks supports using CDH deployed in non-Alibaba Cloud ECS environments, but you need to ensure that the environment where CDH is deployed can connect to Alibaba Cloud virtual private cloud. You can typically use IDC data source network connectivity methods to ensure network connectivity.
A resource group has been bound to the workspace, and network connectivity is ensured.
- When using a Serverless resource group, you only need to ensure that the CDH computing resource has normal connectivity with the Serverless resource group.
- When using the old exclusive resource group, you need to ensure that the CDH computing resource has normal connectivity with the exclusive scheduling resource group in the corresponding scenario.

Limits

Permission limitations:

Operator	Permission description
Alibaba Cloud account	No additional authorization required.
Alibaba Cloud RAM user/RAM role	Only workspace members with O&M and Workspace Administrator roles, or workspace members with `AliyunDataWorksFullAccess` permission can create computing resources. For authorization details, see Grant user workspace administrator permissions.

Access the computing resource list page

Log on to the DataWorks console, switch to the target region, click More > Management Center in the left-side navigation pane, select the corresponding workspace from the dropdown list, and click Enter Management Center.
In the left-side navigation pane, click Computing Resources to access the computing resource list page.

On the computing resource list page, configure the binding of CDH computing resources.

Select the type of computing resource to bind.
1. Click Bind Computing Resource or Create Computing Resource to access the Bind Computing Resource page.
2. On the Bind Computing Resource page, select CDH as the computing resource type to access the Bind CDH Computing Resource configuration page.

Configure the CDH computing resource.

On the Bind CDH Computing Resource configuration page, configure according to the following table.

Parameter		Configuration description
Cluster Version		Select the registered cluster version. DataWorks provides CDH5.16.2, CDH6.1.1, CDH6.2.1, CDH6.3.2, and CDP7.1.7 versions that you can select directly. The component versions (i.e., the versions of each component in the cluster connection information) for these cluster versions are fixed. If these cluster versions do not meet your business needs, you can select Custom Version and configure component versions as needed. Note Different cluster versions require different component configurations. Please refer to the actual interface for details. When using a Custom Version cluster registered with DataWorks, only the old exclusive scheduling resource group is supported. After registration, you need to submit a ticket to contact technical support personnel to initialize the relevant environment.
Cluster Name		Select a cluster name already registered in other workspaces to load related configurations, or customize a cluster name to fill in new configurations.
Cluster Connection Information	Hive Connection Information	Used to submit Hive jobs to the cluster. HiveServer2 configuration format: `jdbc:hive://<host>:<port>/<database>` Metastore configuration format: `thrift://<host>:<port>`	Parameter acquisition method: Obtain CDH or CDP cluster information and configure network connectivity Component version selection: The system will automatically identify the corresponding component version for the current cluster. Note If you use a Serverless resource group to access CDH-related components through domain names, you need to perform authoritative DNS resolution for CDH component domain names in the Internal DNS Resolution (PrivateZone) of Cloud DNS and set the domain name effective scope.
	Impala Connection Information	Used to submit Impala jobs. Configuration format: `jdbc:impala://<host>:<port>/<schema>`.
	Spark Connection Information	If you need to use the Spark component in DataWorks, you can select the default version here and configure it.
	Yarn Connection Information	Task submission and task details viewing configuration. Yarn.Resourcemanager.Address configuration format: `http://<host>:<port>` Note Spark or MapReduce task submission address. Jobhistory.Webapp.Address configuration format: `http://<host>:<port2>` Note Configures the Web UI address of the JobHistory Server, which users can access through a browser to view detailed information about historical tasks.
	MapReduce Connection Information	If you need to use the MapReduce component in DataWorks, you can select the default version here and configure it.
	Presto Connection Information	Used to submit Presto jobs. JDBC address information configuration format: `jdbc:presto://<host>:<port>/<catalog>/<schema>` Note Not a default CDH component, configuration needs to be made according to actual circumstances.
Cluster Configuration Files	Configure Core-Site File	Contains global configurations for the Hadoop Core library. For example, commonly used I/O settings for HDFS and MapReduce.	To run Spark or MapReduce tasks, you need to upload this file.
	Configure Hdfs-Site File	Contains HDFS-related configurations. For example, block size, replication count, path names, etc.
	Configure Mapred-Site File	Used to configure MapReduce-related parameters. For example, configuring the execution method and scheduling behavior of MapReduce jobs.	To run MapReduce tasks, you need to upload this file.
	Configure Yarn-Site File	Contains all configurations related to YARN daemon processes. For example, environment configurations for resource managers, node managers, and application runtime.	You must upload this file to run Spark or MapReduce jobs, or when you select Kerberos as the account mapping type.
	Configure Hive-Site file	Contains parameters for configuring Hive. For example, database connection information, Hive Metastore settings, and execution engines.	You need to upload this file when the account mapping type is set to Kerberos.
	Configure Spark-Defaults File	Used to specify default configurations applied when executing Spark jobs. You can preset a series of parameters (such as memory size, CPU cores) through the `spark-defaults.conf` file, which Spark applications will adopt when running.	To run Spark tasks, you need to upload this file.
	Configure Config.Properties File	Contains configurations related to the Presto server. For example, setting global properties for coordinator nodes and worker nodes in the Presto cluster.	When using the Presto component and the account mapping type is selected as OPEN LDAP or Kerberos, you need to upload this file.
	Configure Presto.Jks File	Used to store security certificates, including private keys and public key certificates issued to applications. In the Presto database query engine, the `presto.jks` file is used to enable SSL/TLS encrypted communication for Presto processes, ensuring the security of data transmission.
Default Access Identity		If you select to use mapped cluster account identities, on the Computing Resources list page, find the Account Mapping tab and click Configure cluster identity mapping. Development environment: You can select cluster account or cluster account mapped to the task executor. Production environment: You can select cluster account, cluster account mapped to the task owner, cluster account mapped to the Alibaba Cloud account, or cluster account mapped to the Alibaba Cloud RAM user.
Computing Resource Instance Name		Customize the computing resource instance name. When running tasks, you can select the computing resource for task execution based on the computing resource name.

Click Confirm to complete the CDH computing resource configuration.

Resource group initialization

For first-time cluster registration or cluster service configuration changes (such as modifying core-site.xml), please initialize the resource group to ensure that the resource group can normally access the CDH cluster through network connectivity configuration.

On the Computing Resources list page, find the CDH computing resource you created. Click Resource Group Initialization in the upper right corner.
Click Initialize next to the required resource group. After the resource group initialization is successful, click OK.

(Optional) Set YARN resource queue

You can find the CDH cluster you bound on the Computing Resources list page, click the YARN Resource Queue tab, and click Edit YARN Resource Queue to set dedicated YARN resource queues for tasks in different modules.

(Optional) Set SPARK parameters

Set dedicated SPARK property parameters for tasks in different modules.

Find the CDH cluster you bound on the Computing Resources list page.
Click the Edit SPARK Parameters button on the SPARK Parameters tab to enter the edit page for CDH cluster SPARK parameters.
By clicking the Add button below the module, enter the Spark Property Name and corresponding Spark Property Value to set Spark property information.

What to do next

After configuring the CDH computing resource, you can perform data development operations through CDH-related nodes in data development.