My Notes

DevOps Email Kafka Kubernetes macOS Mongo DB Productivity Programming Python Powershell REST RDBMS SCM Security Spark Unix Tools

Notes on Productivity tools

Blog tools
- Mermaid
- asciinema
XML
Python
Spark
Docker Databases containers
- Postgres
- MSSQL
Jekylle
Quarto
CSVKit

Blog tools

I was very enthustic to know markdown level diagraming.

Mermaid

One of the best so far found is mermaid which I have used with the my blog tool stackedit.io. For example:

graph TD; A-->B; A-->C; B-->D; C-->D;

such a great diagraming.

asciinema

The tool asciinema record your terminal and upload to cloud. You can install this tool using brew in the MacOS.

XML

Tools for XML

Python

To setup complete python environment, see the Python my workflow. To fond the python directories in the PYTHONPATH:

import sys
import pprint from pprint
pprint(sys.path)

Atom editor for Spark

First set the following path:

export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.7-src.zip

I have used SDK MAN to install the spark home.

Now create virtual environment

pyenv global 2.7.18
virtualenv mypython
source bin/activate
python -m pip install --upgrade pip

In the virtual enviroment, install

pip install ipykernel

Then run the following, if above is not working.

python -m ipykernel install --user --name=env

You can open in Atom editor and do the inline debugging, if you install hydrogen in the editor.

If you want to use PySpark, first install

pip install pyspark

To find the installed pyspark version:

pip show pyspark

If you want, install the following packages to Atom editor:

Script (to execute python from IDE, CMD+i)
autocomplete-python
flake 8 (to enable pip install flake8)
python-autopep8

Diff

Here the way to semantically diff the XML files: First create your project in Python virtual enviroment:

python3 -m venv xmltest
cd xmltest
source bin/activate

You project is xmltest. Now install the graphtage packate

pip install graphtage

now you are ready to compare m1.xml and p1.xml files:

graphtage p1.xml m1.xml

This will give you a out put to CLI. to deactivate, deactivate in the CLI to move out from the project environment.

VSCode extensions for Python

Some of the extensions tested for Python:

Use tools like flake8 and blue. flake8 reports on code styling, among many other issues, and blue rewrites source code according to (most) rules embedded in the black code formatting tool.
isort: organise imports
JSON Path Status Bar: Show JSON path of the element
Output Colorizer: VSCode output in color
Open Folder Context Menu for VS Code: This will open a new instance of VSCode for the selected folder in the Explorer.
Pylint: Lint from Microsoft

Spark

I have configured Spark using SDKMAN.

docker run --name pyspark -e JUPYTER_ENABLE_LAB=yes -e JUPYTER_TOKEN="pyspark"  -v "$(pwd)":/home/jovyan/work -p 8888:8888 jupyter/pyspark-notebook:d4cbf2f80a2a

Use the http://localhost:8888/?token=pyspark to open the jupyter notebook.

To run the Zeppelin:

docker run -u $(id -u) -p 8080:8080 -p 4040:4040 --rm -v $PWD/logs:/logs -v $PWD/:/notebook -e ZEPPELIN_LOG_DIR='/logs' -e ZEPPELIN_NOTEBOOK_DIR='/notebook' --name zeppelin apache/zeppelin:0.10.0

Command to create Apache Airflow

docker run -ti -p 8080:8080 -v ${PWD}/<dag>.py:/opt/airflow/dags/download_rocket_launches.py --name airflow --entrypoint=/bin/bash apache/airflow:2.0.0-python3.8 -c '( airflow db init && airflow users create --username admin --password admin --firstname Anonymous --lastname Admin --role Admin --email ojithak@gmail.com); airflow webserver & airflow scheduler'

Docker Databases containers

Postgres

Create docker image: (In the current directory, create a data folder)

docker run -t -i \
    --name Mastering-postgres \
    --rm \
    -p 5432:5432 \
    -e POSTGRES_PASSWORD=ojitha \
    -v "$(pwd)/data":/var/lib/postgresql/data \
    postgres:13.4

Docker to access psql:

docker exec -it Mastering-postgres bash

Inside the bash run the following command to get into the psql:

psql -h localhost -p 5432 -U postgres

MSSQL

Pull the image

docker pull mcr.microsoft.com/mssql/server:2019-latest

to run

docker run -e "ACCEPT_EULA=Y" -e "MSSQL_SA_PASSWORD=Pwd@2023" `
   -p 1433:1433 --name sql1 --hostname sql1 `
   -v C:\Users\ojitha\dev\mssql\data:/var/opt/mssql/data `
   -v C:\Users\ojitha\dev\mssql\log:/var/opt/mssql/log `
   -d `
   mcr.microsoft.com/mssql/server:2019-latest

Download the sample database from the backup

Run the following fix before restore

docker container exec sql1 touch /var/opt/mssql/data/AdventureWorks2019.mdf
docker container exec sql1 touch /var/opt/mssql/log/AdventureWorks2019_log.ldf

Jekylle

To start Jekylle

bundle exec jekyll serve

Quarto

Quarto is based on the pandoc. Here the workflow to include Jupyter notebook in Jekyll site.

Frirst create Jupyter notebook in the vscode and include the yaml in the raw form.

 ---
 title: PySpark Date Example
 format:
     html:
         code-fold: true
 jupyter: python3        
 ---

now copy the ipynb to temp directory

now run the following command

 quarto render pyspark_date_example.ipynb --to html

copy both of the generated folder and the html file to <jekyll root>/_include foler.
remove the <!DOCTYPE html> first statement from the HTML page

And add the post such as

 ---
 layout: post
 title:  PySpark Date Exmple
 date:   2022-03-02
 categories: [Apache Spark]
 ---
    
 PySpark date in string to date type conversion example. How you can use python sql functions like `datediff` to calculate the differences in days.
    
 <!--more-->
    
 -- include pyspark_date_example.html using liquid --

As shown in the line# 12 embed the html file to post.

Now run the Jekyll if not started

CSVKit

Create a docker bases Postgres Docker container first. The default port is 5432. To import CSV file to postgres test database:

csvsql --db postgresql://postgres:ojitha@localhost/test --insert data.csv

NOTE: Table will be created as test.public.data.

You can query the table:

SELECT * FROM test.public.data
where "invoice_no" in (....)
order by "invoice_no";