#AWS | #NLP | #Serverless | #blog

May 9, 2018

Using NLTK library with AWS Lambda

This is a walk through of the process of creating a simple serverless app for finding part-of-speech tag of an input text.

1 Create virtual environment

In order to separate system-wide dependencies from this app, create a separate virtual environment with:

~ mkvirtualenv nltk_env

2 Install nltk

In the virtual environment use pip to install nltk package:

(nltk_env) ~ pip install nltk

3 Download nltk data

Pip doesn’t install additional files that are needed to the app, but nltk has a helper functions to download them:

(nltk_env) ~ python 
Python 3.6.2 (v3.6.2:5fd33b5926, Jul 16 2017, 20:11:06) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>> nltk.download('tagsets')
[nltk_data] Downloading package tagsets to /Users/as/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.
True

4 Copy downloaded nltk data to current directory

THe helper functions download the extra data to user home directory, so you need to copy them closer to the app code:

(nltk_env) ~ cp -R /Users/as/nltk_data/* ./

5 Copy site packages from virtualenv directory

Now copy all the packages from the site-packages folder of the virtual environment to the folder with the app:

(nltk_env) ~ cp -R /Users/as/.virtualenvs/nltk_env/lib/python3.6/site-packages/* ./

To find site-packages folder you may use which python command.

6 Now let’s create a lambda function code

import imp
import sys
sys.modules["sqlite"] = imp.new_module("sqlite") # (1)
sys.modules["sqlite3.dbapi2"] = imp.new_module("sqlite.dbapi2")

import nltk

from nltk.data import load
tagdict = load('help/tagsets/upenn_tagset.pickle')

def lambda_handler(event, context):
    text = event.get('text')
    tokenized = nltk.word_tokenize(text)
    tagged = nltk.pos_tag(tokenized)
    return {word: tagdict[tag][0] for word, tag in tagged}

(1) Since libsqlite3-dev is not installed in container running lambda this workaround of creating dummy empty modules is needed.

7 Check the size.

There is a limit on the Lambda function code size, so check it with:

(nltk_env) ~ du -sh ./ | cut -f1
187M

8 Zip everything

To deploy lambda zip the folder:

(nltk_env) ~ zip -r -9 -q ./lambda.zip *

9 Upload to S3

Zipped Lambda code is uploaded to S3 from where it will be deployed:

(nltk_env) ~ aws s3 mb s3://serverless-nltk
(nltk_env) ~ aws s3 cp ./lambda.zip s3://serverless-nltk

10 Create lambda

Use AWS CLI to create lambda function and tell it where on S3 the code resides:

(nltk_env) ~ aws lambda create-function \
                    --function-name serverless-nltk \
                    --runtime  python3.6 \
                    --role arn:aws:iam::1234567890:role/lambda_basic_execution \
                    --handler lambda_function.lambda_handler --code S3Bucket=serverless-nltk,S3Key=lambda2.zip \
                    --environment Variables={NLTK_DATA=./}

Key things here are

role arn can be found in IAM (look for role with name lambda_basic_execution)
environment variable NLTK_DATA telling nltk where look for data

Now let’s create a simple javascript application that will call lambda with user input from the page:

Go to AWS Cognito
Create a new identity pool
In the first step check Enable access to unauthenticated identities
In the sample code step select javascript and copy IdentityPoolId (needed in invocation script later)
Go to IAM
Find the role for unauthenticated access (it will look like Cognito_serverless_nltkUnauth_Role)
Select Permission and edit the role as json. It should look like this

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "mobileanalytics:PutEvents",
                "cognito-sync:*"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "lambda:InvokeFunction"
            ],
            "Resource": [
                "arn:aws:lambda:us-east-1:1234567890:function:serverless-nltk"
            ]
        }
    ]
}

The script calling the lambda will look like this

<script type="text/javascript">
    var button = document.getElementById('upload-button');
    AWS.config.credentials = new AWS.CognitoIdentityCredentials({IdentityPoolId: 'us-east-1:8b6a0b3d-6a2a-4c7d-b617-c8dafd8a1aec'});
    AWS.config.region = 'us-east-1';
    var lambda = new AWS.Lambda({region: 'us-east-1', apiVersion: '2015-03-31'});
    
    function htmlToElement(html) {
        var template = document.createElement('template');
        html = html.trim(); // Never return a text node of whitespace as the result
        template.innerHTML = html;
        return template.content.firstChild;
    }
    
    function call_lambda() { 
            var pullParams = {
                FunctionName : 'serverless-nltk',
                InvocationType : 'RequestResponse',
                LogType : 'None',
                Payload : JSON.stringify({text:document.getElementById("exampleFormControlTextarea1").value})
            };
            // create variable to hold data returned by the Lambda function
            var pullResults;
            lambda.invoke(pullParams, function(error, data) {
            if (error) {
                console.log(error);

            } else {
                pullResults = JSON.parse(data.Payload);
                console.log(pullResults);
                var result = document.getElementById("result")
                result.innerHTML = '';
                for (var key in pullResults) 
                {
                    var text = htmlToElement('<span>'+key+':&nbsp;</span>');
                    var pos = htmlToElement('<span>'+pullResults[key]+ '</span>');
                    var line = htmlToElement('<h6></h6>');
                    line.appendChild(text);
                    line.appendChild(pos);
                    result.appendChild(line);
                }
            }
            });
    }; 
</script>