Glue Development using Jupyter
Developing and testing the Glue job in the viscose IDE is one of the best development opportunities because Jupyter doesn’t support IDE features. In this blog, I set up a Glue docker instance in the EC2 and use the vscode Jupyter notebook feature to develop Glue jobs. If you want to create more customized your own Docker image, please see AWS Glue run locally.
- Docker-based environment
- Using Docker compose
- Notebook for JDBC access
- Using pyathena
- Create Development env
Docker-based environment
First, you have to Docker on Amazon Linux as explained in the install.
to start the docker service
sudo service docker start
Create a docker instance based on the Glue 3. AWS has released Glue ver 41 as well:
docker run -it -v ~/.aws:/home/glue_user/.aws -v "$(pwd)":/home/glue_user/workspace/jupyter_workspace/ -e AWS_PROFILE=$PROFILE_NAME -e DISABLE_SSL=true --rm -p 4040:4040 -p 8888:8888 --name glue_pyspark amazon/aws-glue-libs:glue_libs_3.0.0_image_01
This will take you to the container bash shell.
In the container bash prompt, if you want optionally install any library before using any notebook. For example, pyathena to access AWS Athena:
pip3 install pyathena
It is required to start the livy server. Run the following command in the container bash shell prompt:
livy-server start
and run the notebook
jupyter lab --no-browser --ip=0.0.0.0 --allow-root --ServerApp.root_dir=/home/glue_user/workspace/jupyter_workspace/ --ServerApp.token='pyspark' --ServerApp.password=''
Using Docker compose
Instead of the above, if you want more control, create
FROM public.ecr.aws/glue/aws-glue-libs:glue_libs_3.0.0_image_01
WORKDIR /home/glue_user/workspace/jupyter_workspace
ENV DISABLE_SSL=true
RUN python3 -m pip install --upgrade pip
RUN pip3 install pyathena
RUN pip3 install awswrangler
RUN pip3 install pydeequ
CMD [ "./start.sh" ]
Give more visibility to the folder where Dockerfile is located.
sudo chmod -R a+rwx,o-w <folder>
You should have the start.sh file in the same directory with the following two lines
livy-server start
jupyter lab --no-browser --ip=0.0.0.0 --allow-root --ServerApp.root_dir=/home/glue_user/workspace/jupyter_workspace/ --ServerApp.token='pyspark' --ServerApp.password=''
In addition to that your docker-compos.yaml is
version: '3.9'
services:
aws_glue:
build: .
volumes:
- .:/home/glue_user/workspace/jupyter_workspace
privileged: true
ports:
- 8888:8888
- 4040:4040
Use the docker compose to setup this environment.
Notebook for JDBC access
You have to specify the JDBC driver file in the Jupyter before access:
%%configure -f
{"conf": {
"spark.jars": "s3://<s3-bucket>/jdbc/AthenaJDBC42.jar"
}
}
Create a glue context:
from awsglue.job import Job
from awsglue.context import GlueContext
glueContext = GlueContext(spark)
Create a database JDBC connection
from datetime import datetime
con = (
glueContext.read.format("jdbc")
.option("driver", "com.simba.athena.jdbc.Driver")
.option("AwsCredentialsProviderClass","com.simba.athena.amazonaws.auth.InstanceProfileCredentialsProvider")
.option("url", "jdbc:awsathena://athena.ap-southeast-2.amazonaws.com:443")
.option("S3OutputLocation","s3://{}/temp/{}".format('<s3 bucket>', datetime.now().strftime("%m%d%y%H%M%S")))
)
Create spark data frame from the JDBC executing SQL specified in the query
variable:
glue_df = con.option('query', query).load()
glue_df.show(10)
Using pyathena
Create a coursor
from pyathena import connect
cursor = connect(s3_staging_dir="s3://<s3-bucket>/temp/",
region_name="ap-southeast-2").cursor()
Execute the cursor and get the result set:
cursor.execute(query)
print(cursor.description)
Load the data to dataframe
import pandas as pd
from pyathena.pandas.util import as_pandas
df = as_pandas(cursor)
df
Create Development env
You can create EC2 based development environment using the following CFN:
AWSTemplateFormatVersion: '2010-09-09'
Description: Create Dev environment
Parameters:
UserName:
Type: String
Description: Please provide first name to create the environment
Default: ojitha
VpcId:
Type: String
Description: Vpc id where to launch an EC2
SubnetId:
Type: String
Description: Private Subnet where to launch an EC2
MyPublicKey:
Type: String
Description: Please provide your public key
Resources:
SGForDevEC2:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Open SSH port 22 for the EC2 development environment
GroupName: dev-security-group
VpcId: !Ref VpcId
SecurityGroupIngress:
- IpProtocol: tcp
CidrIp: <cider ip range>
FromPort: 22
ToPort: 22
- IpProtocol: tcp
CidrIp: <cider ip range>
FromPort: 22
ToPort: 22
SecurityGroupEgress:
- IpProtocol: -1
CidrIp: 0.0.0.0/0
FromPort: -1
ToPort: -1
Tags:
- Key: Name
Value:
!Join
- "-"
- - !Sub ${UserName}
- 'dev'
- 'sg'
UserPublicKey:
Type: AWS::EC2::KeyPair
Properties:
KeyName:
!Join
- "-"
- - !Sub ${UserName}
- 'dev'
- 'key'
PublicKeyMaterial: !Ref MyPublicKey
EC2ForDev:
Type: AWS::EC2::Instance
Properties:
# ImageId: ami-07620139298af599e
ImageId: ami-0b55fc9b052b03618
InstanceType: t2.large
SubnetId: !Ref SubnetId
KeyName: !Ref UserPublicKey
SecurityGroupIds:
- !GetAtt SGForDevEC2.GroupId
IamInstanceProfile: !Ref RootInstanceProfile
BlockDeviceMappings:
- DeviceName: /dev/xvda
Ebs:
VolumeSize: 50
Encrypted: true
VolumeType: gp2
DeleteOnTermination: true
UserData:
Fn::Base64: |
#!/bin/bash -xe
yum update -y
yum -y install tmux
yum -y install @development zlib-devel bzip2 bzip2-devel readline-devel sqlite sqlite-devel openssl-devel xz xz-devel libffi-devel findutils
yum -y install jq
amazon-linux-extras install -y docker
usermod -a -G docker ec2-user
curl -LS --connect-timeout 5 \
--max-time 10 \
--retry 5 \
--retry-delay 0 \
--retry-max-time 60 \
"https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip -u awscliv2.zip
./aws/install
curl -LS --connect-timeout 5 \
--max-time 10 \
--retry 5 \
--retry-delay 0 \
--retry-max-time 60 \
"https://github.com/aws/aws-sam-cli/releases/latest/download/aws-sam-cli-linux-x86_64.zip" -o "aws-sam-cli-linux-x86_64.zip"
unzip aws-sam-cli-linux-x86_64.zip -d sam-installation
./sam-installation/install
sudo -u ec2-user -i <<'EOF'
echo '--- Install pyenv for ec2-user ---'
source ~/.bashrc
RETRIES=3; DELAY=10; COUNT=1; while [ $COUNT -lt $RETRIES ]; do git clone https://github.com/pyenv/pyenv.git $HOME/.pyenv; if [ $? -eq 0 ]; then RETRIES=0; break; fi; let COUNT=$COUNT+1; sleep $DELAY; done
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
echo 'command -v pyenv >/dev/null || export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
echo 'eval "$(pyenv init -)"' >> ~/.bashrc
source ~/.bashrc
RETRIES=3; DELAY=10; COUNT=1; while [ $COUNT -lt $RETRIES ]; do git clone https://github.com/pyenv/pyenv-virtualenv.git $(pyenv root)/plugins/pyenv-virtualenv; if [ $? -eq 0 ]; then RETRIES=0; break; fi; let COUNT=$COUNT+1; sleep $DELAY; done
echo '--- Install git-remote for ec2-user ---'
pip3 install git-remote-codecommit
RETRIES=3; DELAY=10; COUNT=1; while [ $COUNT -lt $RETRIES ]; do pyenv install 3.9.14 ; if [ $? -eq 0 ]; then RETRIES=0; break; fi; let COUNT=$COUNT+1; sleep $DELAY; done
pyenv virtualenv 3.9.14 p39
echo '--- Install docker-compose for ec2-user ---'
echo 'export DOCKER_CONFIG=${DOCKER_CONFIG:-$HOME/.docker}' >> ~/.bashrc
echo 'export PATH="$DOCKER_CONFIG/cli-plugins:$PATH"' >> ~/.bashrc
source ~/.bashrc
mkdir -p $DOCKER_CONFIG/cli-plugins
curl -LS --connect-timeout 5 \
--max-time 10 \
--retry 5 \
--retry-delay 0 \
--retry-max-time 60 \
https://github.com/docker/compose/releases/download/v2.11.0/docker-compose-linux-x86_64 -o $DOCKER_CONFIG/cli-plugins/docker-compose
chmod +x $DOCKER_CONFIG/cli-plugins/docker-compose
echo '--- end ---'
EOF
Tags:
- Key: Name
Value:
!Join
- "-"
- - !Sub ${UserName}
- 'dev'
- 'ec2'
CPUAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmDescription: CPU alarm to stop Dev instance
AlarmName:
!Join
- "-"
- - !Sub ${UserName}
- 'dev'
- 'alarm'
- 'stop'
ActionsEnabled: true
AlarmActions: [ !Sub "arn:aws:automate:${AWS::Region}:ec2:stop" ]
MetricName: CPUUtilization
Namespace: AWS/EC2
Statistic: Average
Period: '900'
EvaluationPeriods: '3'
Threshold: '0.3'
ComparisonOperator: LessThanOrEqualToThreshold
Dimensions:
- Name: InstanceId
Value:
Ref: EC2ForDev
RootRole:
Type: "AWS::IAM::Role"
Properties:
RoleName:
!Join
- "-"
- - !Sub ${UserName}
- 'dev'
- 'ROOT'
- 'role'
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
-
Effect: "Allow"
Principal:
Service:
- "ec2.amazonaws.com"
Action:
- "sts:AssumeRole"
Path: "/"
ManagedPolicyArns:
- arn:aws:iam::aws:policy/PowerUserAccess
CFNPolicies:
Type: "AWS::IAM::Policy"
Properties:
PolicyName: "CFNPolicy"
PolicyDocument:
Version: "2012-10-17"
Statement:
- Action:
- cloudformation:*
- lambda:*
- sns:*
- events:*
- logs:*
- ec2:*
- s3:*
- dynamodb:*
- kms:*
- iam:*
- states:*
- sts:*
- sqs:*
- elasticfilesystem:*
- config:*
- cloudwatch:*
- apigateway:*
- backup:*
- firehose:*
- backup-storage:*
- ssm:*
Resource: '*'
Effect: Allow
Roles:
- Ref: "RootRole"
VSCodePolicies:
Type: "AWS::IAM::Policy"
Properties:
PolicyName: "VSCode"
PolicyDocument:
Version: "2012-10-17"
Statement:
- Sid: CloudFormationTemplate
Effect: Allow
Action:
- cloudformation:CreateChangeSet
Resource:
- arn:aws:cloudformation:*:aws:transform/Serverless-2016-10-31
- Sid: CloudFormationStack
Effect: Allow
Action:
- cloudformation:CreateChangeSet
- cloudformation:CreateStack
- cloudformation:DeleteStack
- cloudformation:DescribeChangeSet
- cloudformation:DescribeStackEvents
- cloudformation:DescribeStacks
- cloudformation:ExecuteChangeSet
- cloudformation:GetTemplateSummary
- cloudformation:ListStackResources
- cloudformation:UpdateStack
Resource:
- arn:aws:cloudformation:*:111111111111:stack/*
- Sid: S3
Effect: Allow
Action:
- s3:CreateBucket
- s3:GetObject
- s3:PutObject
Resource:
- arn:aws:s3:::*/*
- Sid: ECRRepository
Effect: Allow
Action:
- ecr:BatchCheckLayerAvailability
- ecr:BatchGetImage
- ecr:CompleteLayerUpload
- ecr:CreateRepository
- ecr:DeleteRepository
- ecr:DescribeImages
- ecr:DescribeRepositories
- ecr:GetDownloadUrlForLayer
- ecr:GetRepositoryPolicy
- ecr:InitiateLayerUpload
- ecr:ListImages
- ecr:PutImage
- ecr:SetRepositoryPolicy
- ecr:UploadLayerPart
Resource:
- arn:aws:ecr:*:111111111111:repository/*
- Sid: ECRAuthToken
Effect: Allow
Action:
- ecr:GetAuthorizationToken
Resource:
- '*'
- Sid: Lambda
Effect: Allow
Action:
- lambda:AddPermission
- lambda:CreateFunction
- lambda:DeleteFunction
- lambda:GetFunction
- lambda:GetFunctionConfiguration
- lambda:ListTags
- lambda:RemovePermission
- lambda:TagResource
- lambda:UntagResource
- lambda:UpdateFunctionCode
- lambda:UpdateFunctionConfiguration
Resource:
- arn:aws:lambda:*:111111111111:function:*
- Sid: IAM
Effect: Allow
Action:
- iam:CreateRole
- iam:AttachRolePolicy
- iam:DeleteRole
- iam:DetachRolePolicy
- iam:GetRole
- iam:TagRole
Resource:
- arn:aws:iam::111111111111:role/*
- Sid: IAMPassRole
Effect: Allow
Action: iam:PassRole
Resource: '*'
Condition:
StringEquals:
iam:PassedToService: lambda.amazonaws.com
- Sid: APIGateway
Effect: Allow
Action:
- apigateway:DELETE
- apigateway:GET
- apigateway:PATCH
- apigateway:POST
- apigateway:PUT
Resource:
- arn:aws:apigateway:*::*
Roles:
- Ref: "RootRole"
AthenaPolicies:
Type: "AWS::IAM::Policy"
Properties:
PolicyName: "Athena"
PolicyDocument:
Version: "2012-10-17"
Statement:
- Action:
- athena:ListEngineVersions
- athena:ListWorkGroups
- athena:ListDataCatalogs
- athena:ListDatabases
- athena:GetDatabase
- athena:ListTableMetadata
- athena:GetTableMetadata
Resource: '*'
Effect: Allow
Sid: athenaglobal
- Action:
- athena:GetQueryResultsStream
- athena:GetWorkGroup
- athena:GetQueryExecution
- athena:CreatePreparedStatement
- athena:GetPreparedStatement
- athena:ListPreparedStatements
- athena:UpdatePreparedStatement
- athena:DeletePreparedStatement
- athena:StartQueryExecution
- athena:StopQueryExecution
- athena:GetQueryResults
Resource:
- arn:aws:athena:ap-southeast-2:111111111111:workgroup/primary
Effect: Allow
Sid: athenaWorkgroup
Roles:
- Ref: "RootRole"
S3Policies:
Type: "AWS::IAM::Policy"
Properties:
PolicyName: "S3Policies"
PolicyDocument:
Version: "2012-10-17"
Statement:
- Action:
- s3:GetObject
- s3:ListBucket
Resource:
- arn:aws:s3:::111111111111-<...athena...>-prod*
- arn:aws:s3:::111111111111-<...athena...>-prod*/*
Effect: Allow
Sid: S3Polices
- Action:
- s3:*
Resource:
- arn:aws:s3:::111111111111-oj-temp
- arn:aws:s3:::111111111111-oj-temp/*
Effect: Allow
- Action:
- s3:GetObject
- s3:PutObject
- s3:ListBucket
- s3:DeleteObject
Resource:
- arn:aws:s3:::111111111111-oj-glue
- arn:aws:s3:::111111111111-oj-glue/*
Effect: Allow
Roles:
- Ref: "RootRole"
# # use this template if you need to add access
# RolePolicies:
# Type: "AWS::IAM::Policy"
# Properties:
# PolicyName: "root"
# PolicyDocument:
# Version: "2012-10-17"
# Statement:
# - Effect: "Allow"
# Action: "*"
# Resource: "*"
# Roles:
# - Ref: "RootRole"
RootInstanceProfile:
Type: "AWS::IAM::InstanceProfile"
Properties:
Path: "/"
Roles:
- Ref: "RootRole"
References