MaxCompute cannot directly call external API operations. This tutorial provides a complete case study that shows you how to use an offline IP database and a user-defined function (UDF) to efficiently analyze IP address geolocations in MaxCompute.
Background
MaxCompute is a closed computing environment and cannot directly call external API operations via HTTP, such as the Taobao IP database API, to query IP geolocations in real time. The following three solutions can be used to analyze the geolocation of IP addresses in MaxCompute.
Solution | Implementation | Pros | Cons | Conclusion |
Export data for query | Export the IP addresses to be analyzed from MaxCompute to a local device, and then use a script to cyclically call API operations for queries. | Simple | Low efficiency. Subject to API rate limits (such as 10 QPS), making it unsuitable for large-scale data. | Not recommended |
Query a local database | Download both the IP database and the IP addresses to be analyzed to a local device, and then perform matching using a local program. | No API rate limits | The data is separated from the data warehouse and cannot be associated with other business data (such as order tables) in MaxCompute for analysis. | Not recommended |
Import the IP database for query | Upload the entire IP address database to a MaxCompute table, and then use a UDF in SQL to perform queries. |
| Requires regular manual updates to the IP database. | Recommended |
Conclusion: The third solution is the only one that can meet the demands for high-performance and highly integrated analysis in big data use cases. This tutorial details how to implement this solution.
Solution overview
This solution consists of three main modules:
Data preparation: Upload a public IP address database file to a MaxCompute table as a reference.
UDF development: Create a UDF to convert string IP addresses into integers. This is essential for efficient range matching.
SQL analysis: Write SQL statements that call the UDF to look up IP addresses in the MaxCompute table and obtain their geolocations.
Preparations
Create a workspace that has not opted in to the Data Studio public preview. This tutorial uses a DataWorks workspace in Basic mode.
Bind a serverless resource group to the workspace.
Bind a MaxCompute computing resource to the workspace and test its network connectivity.
In the target workspace, create a business flow.
Procedure
Prepare an IP address database
Sample database: ipdata.csv.
Sample data structure:
The data format is
UTF-8
.The first four data fields are the start and end addresses of the IP range. The first two are in decimal integer format, and the next two are in dotted-decimal format.
Upload the IP address database to MaxCompute
Create a table.
Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
On the DataStudio page, right-click the target business flow and choose
.In the dialog box, select the computing resource you prepared for Engine Instance, enter a name, and click Confirm.
On the node editor page, enter the following code:
-- If a table named ipresource already exists, delete it first. DROP TABLE IF EXISTS ipresource; -- Create a table to store IP addresses. CREATE TABLE IF NOT EXISTS ipresource ( start_ip BIGINT, -- Start address of the IP range (in decimal integer format). end_ip BIGINT, -- End address of the IP range (in decimal integer format). start_ip_arg STRING, -- Start address of the IP range (in dotted-decimal string format, such as "1.0.1.x"). end_ip_arg STRING, -- End address of the IP range (in dotted-decimal string format, such as "1.0.3.x"). country STRING, -- Country name. area STRING, -- Area name (usually empty or same as the country). city STRING, -- City name (may be a province for some data). county STRING, -- District/county name. isp STRING -- Internet service provider (such as China Telecom, China Unicom, and China Mobile). );
Click the
icon to run the code. In the Parameters dialog box, select the serverless resource group you bound and click Run. Wait for Estimate MaxCompute Computing Cost to complete, then click Run.
Upload data to the table.
Click the
icon in the upper-left corner. In the displayed page, click .
Click the
icon in the left navigation pane.
Click Upload Data. Configure the parameters as described in the following table.
Parameter
Description
Specify Data to Be Uploaded
Select File
Upload the
ipdata.csv
file.Configure Destination Table
Compute Engine
MaxCompute
MaxComputeProject Name
Select the taget MaxCompute project.
Destination Table
Select the ipresource table.
ImportantIf the newly created table is not in the list, go to
and manually refresh the metadata forodps.<project_name>.ipresource
.Resource Group
Select the serverless resource group you prepared.
Preview Data of Uploaded File
Click Mapping by Order to map the data in the .csv file to the fields of the ipresource table.
Click Upload Data.
Verify that the data is uploaded.
Return to the ODPS SQL node editor on the DataStudio page. Execute the following SQL query. If you can see the number of data rows and sample data, the upload was successful.
-- Query the number of data rows in the table. SELECT COUNT(*) FROM ipresource; -- Preview the first 10 rows of data. SELECT * FROM ipresource limit 10;
Develop a UDF
To look up the geolocation of an IP address in SQL, you need a UDF that converts a dotted-decimal IP string (such as '1.0.2xx.4x'
) into a comparable decimal integer.
Create a MaxCompute Python resource.
On the DataStudio page, right-click the target business flow and choose
.In the dialog box, enter a name, such as
mc.py
. Select Upload to MaxCompute and click Create.On the node editor page, enter the following code:
from functools import reduce from odps.udf import annotate @annotate("string->bigint") class IPtoInt(object): def evaluate(self, ip): try: return reduce(lambda x, y: (x << 8) + y, map(int, ip.split('.'))) except: return 0
Click the
icon.
Create a MaxCompute UDF.
On the DataStudio page, right-click the target business flow and choose
.In the dialog box, enter a name, such as
ip2int
, and click Create. This will be the UDF name and cannot be changed after creation.On the editor page, configure the parameters. Key parameters are described below:
For more parameter descriptions, see Register a MaxCompute function.
Parameter
Description
Function Type
Keep the default setting Other Function.
Class Name
The main class that implements the UDF. Example value:
mc.IPtoInt
.NoteWhen the resource type is Python, the class name format is Python resource name.Class name (the .py in the resource name is not required).
Resources
Select the resource file
mc.py
from the previous step.Click the
icon to save the function.
Click the
icon. In the dialog box, enter the change description and click Confirm.
Find IP geolocation by using SQL
Return to the ODPS SQL node editor. Execute the following SQL query:
-- Replace this with the actual IP address you want to query. SELECT * FROM ipresource WHERE ip2int('1.0.2xx.4x') >= start_ip AND ip2int('1.0.2xx.4x') <= end_ip;
Click the
icon to run the code. In the Parameters dialog box, select the serverless resource group you bound and click Run. Wait for Estimate MaxCompute Computing Cost to complete, then click Run.
View the execution result. You can find the location of the IP address.