Wednesday, April 22, 2015

Access AWS DynamoDB Using Presto SQL Like Query

Presto is a good tool from Facebook to query multiple data sources. It supports multiple plugins (called "Connector") to talk to the physical data source like database or Apache Cassandra. But it lacks of support for AWS Dynamo DB.

I am developing a plugin for DynamoDB Connector. It is still in very early stage, and I hope this blog can get more people know about it and join the development work together with me.

Steps to use the DynamoDB Connector.

1. Download Source Code for DynamoDB Connector and Build the code

Go to my github https://github.com/Zutai/presto. It contains the full source code of Presto, and a new presto-dynamo child directory which contains my code for DynamoDB.

Switch to branch "presto_0.100". This is a MUST for now, since my code is based on Presto 0.100 and is only buildable with Presto 0.100. Thus I created that branch. The master branch in my repository is not buildable in current stage.

Download the source code from presto_0.100 brach. Go to presto-dynamo directory. Run following command to build:
  mvn compile package -DskipTests

After build, you will see file "presto-dynamo-0.100.zip" in the target directory. Unzip the zip file, and you will see multiple jar files inside. These jar files are all needed.

2. Deploy Dynamo Connector to Presto and Configure DynamoDB Metadata File

In Presto server, create a new directory "dynamo" under the plugin directory. Copy all the jar files inside the zip file in previous step to the "dynamo" directory.

Create a catalog properties file for Dynamo under director: etc/catalog/dynamo.properties with following content:

connector.name=dynamo
dynamo.metadata-file=/home/YourUser/presto-dynamo-metadata.json

The metadata file "/home/YourUser/presto-dynamo-metadata.json" will contain the metadata for your DynamoDB, for example, table name, column names and types.

Following is an example of presto-dynamo-metadata.json:

{
    "tables": [
        {
            "columns": [
                {
                    "columnName": "UserId",
                    "columnType": "STRING",
                    "typeArguments": []
                },
                {
                    "columnName": "Version",
                    "columnType": "LONG",
                    "typeArguments": []
                }
            ],
            "region": "us_west_2",
            "tableName": "Users"
        },
        {
            "columns": [
                {
                    "columnName": "BookName",
                    "columnType": "STRING",
                    "typeArguments": []
                },
                {
                    "columnName": "Writers",
                    "columnType": "LIST",
                    "typeArguments": [
                        "STRING"
                    ]
                }
            ],
            "region": "us_west_2",
            "tableName": "Books"
        }
    ]
}

Now start Presto server, for example:

  [PrestoDirectory]/bin/launcher run

3. Run Presto Command Line to Query DynamoDB

Download  Follow this link to download the command line tool (presto-cli-0.100-executable.jar), rename it to presto and make it executable (chmod +x).

Then run following command to start it with dynamo catalog and us_west_2 as schema.

./presto --server localhost:8080 --catalog dynamo --schema us_west_2

The AWS region is mapped to schema. Currently only us_west_2 is supported.

Then you can run query. For example, I previously manually created Users table in AWS DynamoDB, so I can run following command to query the table:

  select * from Users;

You will see result like following:

presto:us_west_2> select * from users;
  UserId  | Version
----------+---------
 001       | NULL    
(1 rows)


Note: current Dynamo Connector implementation always does full table scan. More work needs to be done to enable query by key or index.

[To Be Added More Content]