This blog explains how to create an AWS Glue container1 to develop PySpark scripts locally. I’ve already explained how to run the Glue locally using Glue Development using Jupyter.

You have to create start.sh and both the Dockerfile and the docker-compose.yml file in the same directory for this blog:

Docker file

Here the Dockerfile:

FROM amazon/aws-glue-libs:glue_libs_3.0.0_image_01
WORKDIR /home/glue_user/workspace/jupyter_workspace
ENV DISABLE_SSL=true
RUN pip3 install pyathena
CMD [ "./start.sh" ]

You can install any Python libraries as shown in the line #4.

You have to have start.sh which start the Jupyter:

livy-server start
jupyter lab --no-browser --ip=0.0.0.0 --allow-root --ServerApp.root_dir=/home/glue_user/workspace/jupyter_workspace/ --ServerApp.token='pyspark' --ServerApp.password=''

Docker compose file

Here the docker-compose.yaml:

version: '3.9'
services:
  aws_glue:
    build: .
    volumes:
      - .:/home/glue_user/workspace/jupyter_workspace
    privileged: true
    ports:
      - 8888:8888
      - 4040:4040

Run

To run the docker:

docker compose -up

If you are using vscode, first you have to specify the Jupyter server connection:

Jupyter Server Connection

and specify the remote server URL (http://localhost:8888/?token=pyspark):

Jupyter remote server URL

In the command pallet, select Create: New Jupyter Notebook.

Connect to remote docker glue server

In the notebook select the Glue Spark - Local (PySpark) (Remote) Jupyter Kernal. In the above Dockerfile, because you install the PyAthena, it should be possible to import pyathena as a test.

You can use the http://localhost:4040 to access the SparkUI.