DataWorks provides the ODPS SQL node, which lets you periodically schedule MaxCompute SQL tasks. You can also integrate these tasks with other node types for combined scheduling. MaxCompute SQL tasks use an SQL-like syntax and are ideal for the distributed processing of large datasets (in terabytes) that do not require real-time results. This topic describes important considerations and provides guidance for developing SQL tasks in DataWorks.
About the node
ODPS SQL is used to process and query data in MaxCompute. It supports common SQL operations, such as SELECT, INSERT, UPDATE, and DELETE, along with MaxCompute-specific syntax and functions. ODPS SQL lets you write SQL-like statements to query and process data without having to write complex data processing logic. For more information about the SQL syntax, see SQL overview.
Limits
The following table describes the limits on developing ODPS SQL nodes in DataWorks.
Classification | Description |
Comment usage | Supports single-line comments that start with For more information, see MaxCompute SQL comments. The following limits also apply to comments.
|
SQL submission | ODPS SQL does not support running SET or USE statements alone. They must be executed with another SQL statement. |
SQL development | The SQL code size cannot exceed 128 KB. The number of SQL commands cannot exceed 200. |
Query results | Only SQL statements that start with SELECT or WITH can output a formatted result set. Query results have the following limits:
Note If you encounter these limits, download the query results to a local device for viewing using one of the following methods:
|
Notes
Ensure that the account used to run the ODPS SQL task has the required permissions for the corresponding MaxCompute project. For more information, see DataWorks on MaxCompute permission control and MaxCompute permissions.
MaxCompute SQL tasks require quota resources to run. If your task is taking a long time to run, go to the MaxCompute console to check the quota resource consumption. Ensure that sufficient resources are available for the task to run. For more information, see Computing resources - Quota management.
When you develop an ODPS SQL node task, special parameters, such as an OSS address, must be enclosed in double quotation marks (""). If you do not enclose them in quotation marks, the task may fail to parse, which causes the task to fail.
In some extreme cases, such as an unexpected server power outage or a primary/secondary failover, DataWorks may not be able to completely stop the related MaxCompute task processes. If this occurs, go to the corresponding project in MaxCompute Computing Resources to stop the job.
The execution order of statements that contain keywords, such as SET and USE, varies depending on the environment in DataWorks.
In DataStudio: All statements in the current task code that contain keywords such as SET and USE are combined. They are then executed before all other SQL statements.
In the scheduling environment: The statements are executed in the order they are written.
Assume that the code in the node is defined as follows.
SET a=b; CREATE TABLE name1(id string); SET c=d; CREATE TABLE name2(id string);
The execution order in different environments is as follows:
Executed SQL
DataStudio
Scheduling O&M
First SQL statement
SET a=b; SET c=d; CREATE TABLE name1(id string);
SET a=b; CREATE TABLE name1(id string);
Second SQL statement
SET a=b; SET c=d; CREATE TABLE name2(id string);
SET c=d; CREATE TABLE name2(id string);
Edit code: Simple examples
SQL commands
MaxCompute SQL commands use a syntax similar to standard SQL. They support DDL, DML, and DQL statements, along with commands with MaxCompute-specific syntax. For detailed syntax requirements and examples of each SQL command, see SQL overview. The following simple examples show how to develop and run SQL commands.
When a MaxCompute V2.0 extension function uses a new data type, you must add the
SET odps.sql.type.system.odps2=true;
command before the SQL statement of the function. Then, you must submit this command together with the SQL statement to use the new data type. For more information about V2.0 data types, see Data types (V2.0).DataWorks provides scheduling parameters to allow for dynamic parameter passing in scheduling scenarios. In an ODPS SQL node, you can define variables in the code using the
${variable_name}
format. Then, on the Schedule tab, you can assign a value to the variable in the Parameters section. For more information about the supported formats for scheduling parameters, see Supported formats of scheduling parameters.
Create a table
You can use the
CREATE TABLE
statement to create non-partitioned tables, partitioned tables, foreign tables, and clustered tables. For more information, see CREATE TABLE. The following code provides a sample SQL statement:-- Create a partitioned table named students. CREATE TABLE if NOT EXISTS students ( id BIGINT, name STRING, age BIGINT, birth DATE) partitioned BY (gender STRING);
Insert data
You can use the
INSERT INTO
orINSERT OVERWRITE
statement to insert or update data in a destination table. For more information, see Insert or overwrite data (INSERT INTO | INSERT OVERWRITE). The following code provides a sample SQL statement:-- Insert data. INSERT INTO students PARTITION(gender='boy') VALUES (1,'Zhang San',15,DATE '2008-05-15') ; INSERT INTO students PARTITION(gender='boy') VALUES (2,'Li Si',17,DATE '2006-07-20') ; INSERT INTO students PARTITION(gender='girl') VALUES (3,'Li Xia',20,DATE '2003-04-20') ; INSERT INTO students PARTITION(gender='girl') VALUES (4,'Wang Lan',21,DATE '2002-01-08') ; INSERT INTO students PARTITION(gender='boy') VALUES (5,'Wang Wu',17,DATE '2006-09-12') ;
ImportantAvoid using the
INSERT INTO
statement to insert data because it may cause unexpected data duplication. We recommend that you useINSERT OVERWRITE
instead. For more information, see Insert or overwrite data.Query data
You can use the
SELECT
statement to perform operations such as nested queries, group queries, and sorting. For more information, see SELECT syntax. The following code provides a sample SQL statement:-- Enable a full table scan. This setting is valid only for the current session. SET odps.sql.allow.fullscan=true; -- Query the information of all male students and sort the results by ID in ascending order. SELECT * FROM students WHERE gender='boy' ORDER BY id;
NoteBy default, Resource Access Management (RAM) users do not have permissions to query data from production tables. To obtain these permissions, you must request them in Security Center. For more information about preset database permissions and access control for MaxCompute in DataWorks, see MaxCompute database permission control details. For more information about how to grant permissions using MaxCompute commands, see Manage user permissions using commands.
SQL functions
MaxCompute supports built-in functions and user-defined functions (UDFs) for data development and analysis. For more information about built-in functions, see Built-in functions overview. For more information about UDFs, see MaxCompute UDF overview. The following simple examples show how to use SQL functions.
Built-in functions: Built-in functions are predefined in MaxCompute and can be called directly. Based on the preceding example, you can use the
dateadd
function to modify thebirth
column by a specified unit and interval. The following code provides an example of the command:-- Enable a full table scan. This setting is valid only for the current session. SET odps.sql.allow.fullscan=true; SELECT id, name, age, birth, dateadd(birth,1,'mm') AS birth_dateadd FROM students;
User-defined functions (UDFs): To use a user-defined function (UDF), you must write the function code, upload the code as a resource, and then register the function. For more information, see Create and use a UDF.
Run the task and view the results
The run results are displayed in a workbook. You can perform operations in DataWorks, open the results in a workbook, or copy and paste the content into a local Excel file. For more information, see Task debugging process.
NoteDue to an adjustment in the China time zone information released by the International Organization for Standardization, a time difference may exist for certain periods when you run related SQL statements in DataWorks. The difference is 5 minutes and 52 seconds for dates from 1900 to 1928, and 9 seconds for dates before 1900.
Runtime log: You can click the Runtime Log tab to view the Logview. For more information, see Use Logview V2.0 to view job running information.
Results:
Query the information of all male students and sort the results by ID in ascending order.
+------------+------------+------------+------------+------------+ | id | name | age | birth | gender | +------------+------------+------------+------------+------------+ | 1 | Zhang San | 15 | 2008-05-15 | boy | | 2 | Li Si | 17 | 2006-07-20 | boy | | 5 | Wang Wu | 17 | 2006-09-12 | boy | +------------+------------+------------+------------+------------+
Change the values in the birth column by a specified unit and interval.
+------------+------------+------------+------------+---------------+ | id | name | age | birth | birth_dateadd | +------------+------------+------------+------------+---------------+ | 4 | Wang Lan | 21 | 2002-01-08 | 2002-02-08 | | 3 | Li Xia | 20 | 2003-04-20 | 2003-05-20 | | 2 | Li Si | 17 | 2006-07-20 | 2006-08-20 | | 1 | Zhang San | 15 | 2008-05-15 | 2008-06-15 | | 5 | Wang Wu | 17 | 2006-09-12 | 2006-10-12 | +------------+------------+------------+------------+---------------+
On the query results page, you can sort the results by a specified field and perform operations such as downloading the results. For more information, see Process query results.
Edit code: Advanced examples
The following topics provide more complex examples of ODPS SQL tasks:
FAQ
Q: Why does my ODPS SQL task take a long time to run and remain in the waiting state?
A: MaxCompute SQL tasks require quota resources to run. If your task is taking a long time to run, go to the MaxCompute console to check the quota resource consumption. Ensure that sufficient resources are available for the task to run. For more information, see Computing resources - Quota management.
Q: Why is the "You have No privilege 'odps:xxxx' on xxxx" error reported when I run an ODPS SQL task?
A: Ensure that the account used to run the ODPS SQL task has the required permissions for the corresponding MaxCompute project. For more information, see MaxCompute permissions.