Friday, July 30, 2010

IP To Country Lookup Table Using MySQL Spatial Index

I have an IP to Country lookup table that is used by various applications. The table is very simple:

CREATE TABLE `lu_ip_to_country` (
     id INT UNSIGNED NOT NULL auto_increment,
     start_ip_num int(10) UNSIGNED NOT NULL,
     end_ip_num int(10) UNSIGNED NOT NULL,
     country_id char(2) NOT NULL,
     iso_country_code varchar(3) DEFAULT NULL,
     country_name varchar(100) DEFAULT NULL,
     PRIMARY KEY (id),
     key start_end_ip (start_ip_num,end_ip_num)
);
  •  Start_ip_num: Beginning IP Number
  •  End_ip_num: Ending IP Number
  •  Country_id: two character country abbreviations (US, BG, FR, etc.) similar to ISO code
  •  ISO_country_code: ISO 3166 Country Code
  •  Country_name: Country Name

Beginning IP Number and Ending IP Number are the number version of the IPs which are calculated as follows:
       Ip number = 16777216*w + 65536*x + 256*y + z
Where w, x, y, z represent the IP address (w, x,y,z). For example: if ip is 61.94.7.255,then w=61,x=94,y=7 and z=255.

 The queries that use to lookup country information are simple as well:
      Select country_id from lu_ip_to_country
      Where v_ip_num_lookup between start_ip_num and end_ip_num

In this query, the v_ip_num_lookup is a variable used to store the calculated ip number. For example, '61.94.7.255’=1029572607.

The other query utilized the MySQL INET_ATON() function, which convert the ip to ip number. i.e. INET_ATON('61.94.7.255') = 1029572607. The query is the following:

   Select country_id from lu_ip_to_country
   Where INET_ATON('61.94.7.255') between start_ip_num and end_ip_num

These tow queries worked fine but slow. It will take up-to 1.5 second to return the result. And they were even slower when there were concurrent requests to this table. I had tried different indexes, and even tried to use memory table, but the improvements were not notable.

The other week, while I was searching the web, I came across an article, which was written by Jeremy Cole, described how to use MySQL GIS with spatial R-tree indexes for similar ip-to-country lookup in MySQL.I decided to give it a try. And, it turn out that the result was excellence.

Jeremy Cole’s article gave a vary detail explanation on how GIS with spatial R-tree indexes work in this case. Here are the steps on how I implemented and tested it (following what were described in the article).

1. Reconstruct the table to using POLYGON type and spatial index

CREATE TABLE lu_ip_to_country_geoip (
     id INT UNSIGNED NOT NULL auto_increment,
     ip_poly POLYGON NOT NULL,
     start_ip_num int(10) UNSIGNED NOT NULL,
     end_ip_num int(10) UNSIGNED NOT NULL,
     country_id varchar(3) NOT NULL,
     iso_country_code varchchar(3) DEFAULT NULL,
    country_name varchar(100) DEFAULT NULL,
    PRIMARY KEY (id),
    SPATIAL INDEX (ip_poly)
);

2. Export the data from the existing table
     Select start_ip_num,
                end_ip_num,
                country_id,
               iso_country_code,
              country_name
             Into outfile ‘/tmp/ip-country-data.dat’
     FIELDS
            TERMINATED BY ","
              ENCLOSED BY "\""
     LINES
           TERMINATED BY "\n"
      FROM lu_ip_to_country;

If you don’t have the data in the first place, you can download them from MaxMind web site (http://www.maxmind.com/app/country) as mentioned in the article.

3. Load the data into the new table with the MySQL GIS function to build the POLYGON for ip_poly field from the start_ip_num and end_ip_num fields.

    LOAD DATA LOCAL INFILE "/tmp/ip-country-data.dat"
         INTO TABLE lu_ip_to_country_geoip
         FIELDS
            TERMINATED BY ","
             ENCLOSED BY "\""
         LINES
              TERMINATED BY "\n"
           (
               @start_ip_num, @end_ip_num,
                @country_id,@iso_country_code, @country_name
             )
       SET
            id := NULL,
            start_ip_num := @start_ip_num,
           end_ip_num := @end_ip_num,
           ip_poly := GEOMFROMWKB(POLYGON(LINESTRING(
          /* clockwise, 4 points and back to 0 */
          POINT(@start_ip_num, -1), /* 0, top left */
          POINT(@end_ip_num, -1), /* 1, top right */
          POINT(@end_ip_num, 1), /* 2, bottom right */
          POINT(@start_ip_num, 1), /* 3, bottom left */
         POINT(@start_ip_num, -1) /* 0, back to start */))),
       country_id := @country_code,
       iso_country_code := @iso-country_code,
       country_name := @country_string;

4. Test queries

mysql> SELECT country_id,iso_country_code,country_name FROM lu_ip_to_country WHERE INET_ATON('61.94.7.255') BETWEEN start_ip_address AND end_ip_address;
+------------+------------------+--------------+
| country_id | iso_country_code | country_name |
+------------+------------------+--------------+
| ID | IDN | Indonesia |
+------------+------------------+--------------+
1 row in set (1.57 sec)

mysql> SELECT country_id,iso_country_code,country_name FROM lu_ip_to_country_geoip WHERE MBRCONTAINS(ip_poly, POINTFROMWKB(POINT(INET_ATON('61.94.7.255'), 0)));
+------------+------------------+--------------+
| country_id | iso_country_code | country_name |
+------------+------------------+--------------+
| ID | IDN | Indonesia |
+------------+------------------+--------------+
1 row in set (0.35 sec)

5. Perform A/B Test on ETL process.

I also selected a reasonable data file and ran through the process using old ip-to-country table for process A and new ip-to-country table for process B and QA’edfor the following:

  •  Check to make sure all Country IDs/Codes match exactly on both out put
  • Compare times
 Both processes produced the same country code lookups except the new table made the same process run much faster.

References:


No comments: