Develop an ODPS SQL task - DataWorks - Alibaba Cloud Documentation Center

DataWorks provides the ODPS SQL node, which lets you periodically schedule MaxCompute SQL tasks. You can also integrate these tasks with other node types for combined scheduling. MaxCompute SQL tasks use an SQL-like syntax and are ideal for the distributed processing of large datasets (in terabytes) that do not require real-time results. This topic describes important considerations and provides guidance for developing SQL tasks in DataWorks.

About the node

ODPS SQL is used to process and query data in MaxCompute. It supports common SQL operations, such as SELECT, INSERT, UPDATE, and DELETE, along with MaxCompute-specific syntax and functions. ODPS SQL lets you write SQL-like statements to query and process data without having to write complex data processing logic. For more information about the SQL syntax, see SQL overview.

Limits

The following table describes the limits on developing ODPS SQL nodes in DataWorks.

Classification	Description
Comment usage	Supports single-line comments that start with `--`. Multi-line comments enclosed in `/* */` are not supported. For more information, see MaxCompute SQL comments. The following limits also apply to comments. Comments cannot be added after a statement that contains only a keyword, such as SET or USE. Semicolons (;) are not supported in comments. Comments cannot be added to the end of a completed statement. A statement is considered complete if it ends with a semicolon (;).


SQL submission	ODPS SQL does not support running SET or USE statements alone. They must be executed with another SQL statement.
SQL development	The SQL code size cannot exceed 128 KB. The number of SQL commands cannot exceed 200.
Query results	Only SQL statements that start with SELECT or WITH can output a formatted result set. Query results have the following limits: If a query result contains more than 10,000 rows, a maximum of 10,000 rows are displayed. If a query result exceeds 10 MB in size, the error `Result is too large, exceed the limit size: 10MB` is reported. Note If you encounter these limits, download the query results to a local device for viewing using one of the following methods: Method 1: Run the Tunnel Download command using the MaxCompute client (odpscmd) to download the results. Method 2: Use the export feature of the SQL Query module in DataWorks to export the query results to a local device.

Notes

Ensure that the account used to run the ODPS SQL task has the required permissions for the corresponding MaxCompute project. For more information, see DataWorks on MaxCompute permission control and MaxCompute permissions.
MaxCompute SQL tasks require quota resources to run. If your task is taking a long time to run, go to the MaxCompute console to check the quota resource consumption. Ensure that sufficient resources are available for the task to run. For more information, see Computing resources - Quota management.
When you develop an ODPS SQL node task, special parameters, such as an OSS address, must be enclosed in double quotation marks (""). If you do not enclose them in quotation marks, the task may fail to parse, which causes the task to fail.
In some extreme cases, such as an unexpected server power outage or a primary/secondary failover, DataWorks may not be able to completely stop the related MaxCompute task processes. If this occurs, go to the corresponding project in MaxCompute Computing Resources to stop the job.

The execution order of statements that contain keywords, such as SET and USE, varies depending on the environment in DataWorks.

In DataStudio: All statements in the current task code that contain keywords such as SET and USE are combined. They are then executed before all other SQL statements.
In the scheduling environment: The statements are executed in the order they are written.

Assume that the code in the node is defined as follows.

SET a=b;
CREATE TABLE name1(id string);
SET c=d;
CREATE TABLE name2(id string);

The execution order in different environments is as follows:

Executed SQL	DataStudio	Scheduling O&M
First SQL statement	`SET a=b; SET c=d; CREATE TABLE name1(id string);`	`SET a=b; CREATE TABLE name1(id string);`
Second SQL statement	`SET a=b; SET c=d; CREATE TABLE name2(id string);`	`SET c=d; CREATE TABLE name2(id string);`

Edit code: Simple examples

SQL commands

MaxCompute SQL commands use a syntax similar to standard SQL. They support DDL, DML, and DQL statements, along with commands with MaxCompute-specific syntax. For detailed syntax requirements and examples of each SQL command, see SQL overview. The following simple examples show how to develop and run SQL commands.

Note

When a MaxCompute V2.0 extension function uses a new data type, you must add the SET odps.sql.type.system.odps2=true; command before the SQL statement of the function. Then, you must submit this command together with the SQL statement to use the new data type. For more information about V2.0 data types, see Data types (V2.0).
DataWorks provides scheduling parameters to allow for dynamic parameter passing in scheduling scenarios. In an ODPS SQL node, you can define variables in the code using the ${variable_name} format. Then, on the Schedule tab, you can assign a value to the variable in the Parameters section. For more information about the supported formats for scheduling parameters, see Supported formats of scheduling parameters.

Create a table
You can use the CREATE TABLE statement to create non-partitioned tables, partitioned tables, foreign tables, and clustered tables. For more information, see CREATE TABLE. The following code provides a sample SQL statement:
```
-- Create a partitioned table named students.
CREATE TABLE if NOT EXISTS students
( id BIGINT,
  name STRING,
  age BIGINT,
  birth DATE)
partitioned BY (gender STRING); 
```

Insert data

You can use the INSERT INTO or INSERT OVERWRITE statement to insert or update data in a destination table. For more information, see Insert or overwrite data (INSERT INTO | INSERT OVERWRITE). The following code provides a sample SQL statement:

-- Insert data.
INSERT INTO students PARTITION(gender='boy') VALUES (1,'Zhang San',15,DATE '2008-05-15') ;
INSERT INTO students PARTITION(gender='boy') VALUES (2,'Li Si',17,DATE '2006-07-20') ;
INSERT INTO students PARTITION(gender='girl') VALUES (3,'Li Xia',20,DATE '2003-04-20') ;
INSERT INTO students PARTITION(gender='girl') VALUES (4,'Wang Lan',21,DATE '2002-01-08') ;
INSERT INTO students PARTITION(gender='boy') VALUES (5,'Wang Wu',17,DATE '2006-09-12') ;

Important

Avoid using the INSERT INTO statement to insert data because it may cause unexpected data duplication. We recommend that you use INSERT OVERWRITE instead. For more information, see Insert or overwrite data.

Query data
You can use the SELECT statement to perform operations such as nested queries, group queries, and sorting. For more information, see SELECT syntax. The following code provides a sample SQL statement:
```
-- Enable a full table scan. This setting is valid only for the current session.
SET odps.sql.allow.fullscan=true; 

-- Query the information of all male students and sort the results by ID in ascending order.
SELECT * FROM students WHERE gender='boy' ORDER BY id;
```
Note
By default, Resource Access Management (RAM) users do not have permissions to query data from production tables. To obtain these permissions, you must request them in Security Center. For more information about preset database permissions and access control for MaxCompute in DataWorks, see MaxCompute database permission control details. For more information about how to grant permissions using MaxCompute commands, see Manage user permissions using commands.

SQL functions

MaxCompute supports built-in functions and user-defined functions (UDFs) for data development and analysis. For more information about built-in functions, see Built-in functions overview. For more information about UDFs, see MaxCompute UDF overview. The following simple examples show how to use SQL functions.

Built-in functions: Built-in functions are predefined in MaxCompute and can be called directly. Based on the preceding example, you can use the dateadd function to modify the birth column by a specified unit and interval. The following code provides an example of the command:
```
-- Enable a full table scan. This setting is valid only for the current session.
SET odps.sql.allow.fullscan=true; 
SELECT id, name, age, birth, dateadd(birth,1,'mm') AS birth_dateadd FROM students;
```
User-defined functions (UDFs): To use a user-defined function (UDF), you must write the function code, upload the code as a resource, and then register the function. For more information, see Create and use a UDF.

Run the task and view the results

The run results are displayed in a workbook. You can perform operations in DataWorks, open the results in a workbook, or copy and paste the content into a local Excel file. For more information, see Task debugging process.
Note
Due to an adjustment in the China time zone information released by the International Organization for Standardization, a time difference may exist for certain periods when you run related SQL statements in DataWorks. The difference is 5 minutes and 52 seconds for dates from 1900 to 1928, and 9 seconds for dates before 1900.
Runtime log: You can click the Runtime Log tab to view the Logview. For more information, see Use Logview V2.0 to view job running information.

Results:

Query the information of all male students and sort the results by ID in ascending order.

+------------+------------+------------+------------+------------+
| id         | name       | age        | birth      | gender     |
+------------+------------+------------+------------+------------+
| 1          | Zhang San  | 15         | 2008-05-15 | boy        |
| 2          | Li Si      | 17         | 2006-07-20 | boy        |
| 5          | Wang Wu    | 17         | 2006-09-12 | boy        |
+------------+------------+------------+------------+------------+

Change the values in the birth column by a specified unit and interval.

+------------+------------+------------+------------+---------------+
| id         | name       | age        | birth      | birth_dateadd |
+------------+------------+------------+------------+---------------+
| 4          | Wang Lan   | 21         | 2002-01-08 | 2002-02-08    |
| 3          | Li Xia     | 20         | 2003-04-20 | 2003-05-20    |
| 2          | Li Si      | 17         | 2006-07-20 | 2006-08-20    |
| 1          | Zhang San  | 15         | 2008-05-15 | 2008-06-15    |
| 5          | Wang Wu    | 17         | 2006-09-12 | 2006-10-12    |
+------------+------------+------------+------------+---------------+

On the query results page, you can sort the results by a specified field and perform operations such as downloading the results. For more information, see Process query results.

Edit code: Advanced examples

The following topics provide more complex examples of ODPS SQL tasks:

FAQ

Q: Why does my ODPS SQL task take a long time to run and remain in the waiting state?
A: MaxCompute SQL tasks require quota resources to run. If your task is taking a long time to run, go to the MaxCompute console to check the quota resource consumption. Ensure that sufficient resources are available for the task to run. For more information, see Computing resources - Quota management.
Q: Why is the "You have No privilege 'odps:xxxx' on xxxx" error reported when I run an ODPS SQL task?
A: Ensure that the account used to run the ODPS SQL task has the required permissions for the corresponding MaxCompute project. For more information, see MaxCompute permissions.