Posts

Spark - create database and table

This is a short note to create a Hive meta store using Spark 3.3.1.

Semantic search with ELSER in Elasticsearch

Elastic Learned Sparse EncodeR(ELSER) is a retrieval model trained by Elastic that enables you to perform semantic search to retrieve more relevant search results.

Summary of ELSER process

  1. install ELSER v2: Only once (DevOPs will do for your)
  2. Create source index where you can insert all your documents
  3. Create target index
  4. Create ingestion pipeline
  5. Reindex process to create embeddings
  6. Ready to do semantic search using text expansion queries

I created this blog post on docker to demonstrate Linux-optimised ELSER v2. The Elasticsearch version is 8.11.1.

Kafka PySpark streaming example

The diagram shows that the Kafka producer reads from Wikimedia and writes to the Kafka topic. Then Kafka Spark consumer pulls the data from the Kafka topic and writes the steam batches to disk.

arcitecture of the streaming application

Terraform For each iteration

This is to explain Terraform for each looping technique. In this example, 3 buckets are created to demonstrate the looping idea.

create 3 S3 buckets

In the first step, we will create the above 3 buckets starting from 0.

Spark to create a table in AWS Redshift

In this post, Spark reads the data from a CSV file to a DateFrame and saves that DataFrame as a Redshift table.

Spark to Redshift

In addition to that, I’ve explained how to create a table in Postgres, use Jupyter magics and plot a diagram.

Spark Kafka Docker Configuration

This is the continuation of the Spark Streaming Basics. I explained the basic stream example, which runs only on one AWS Glue container. The stream producer was Netcat, and the sink was a text file. In this post, the stream producer is still Netcat, but the sink is Kafka. Both Kafka and Spark running on Docker containers.

Simple Streaming with Spark and Kafka

Spark Streaming Basics

This is a very basic example created to explain Spark streaming. Spark run on the AWS Glue container locally.

Introduction to Lambda Calculus

This is a short description of lambda calculus. Lambda calculus is the smallest programming language that is capable of variable substitution and a single function definition scheme. Haskell is the functional programming language based on lambda calculus, which I will explore. I already explained how to use VSCode for Haskell Development to support the code listed here.

Python Parameter passing

Discuss the most possible ways of passing parameters in the python functions.

Python Data Classes

Python Data classes using collections.namedtuple, typing.NamedTuple and latest @dataclass decorator.

Scala - S3 bucket operations

How to list and upload S3 bucket contents using Scala.

Scala - AWS EMR Serverless

AWS EMR Serverless is a cost effective AWS Service to which you can submit Spark Scala jobs.

AWS CI/CD pipeline to Copy files to S3 bucket

Sometime it is necessary to copy files to AWS S3 via CI/CD build pipelines.

Notes on Introduction to Advanced Bash Usage

While I am going through the following, the youtube talk and it’s associated presentation, my hand-ons were recorded here. It is recommended to go through the basics first. You can also refer to the Bash Ref Manual for more information.

Pandas type conversion

Sometimes we need to remove unnecessary data and save the column in the right format in the Pandas data frames.

AWS Glue run locally

This blog explains how to create an AWS Glue container1 to develop PySpark scripts locally. I’ve already explained how to run the Glue locally using Glue Development using Jupyter.

Access AWS SSM via AWS Stepfunctions

Configuration will be availble throughout the pipeline, if that can be stored in the AWS Stepfunctions. Generally congiruation should be stored in the SSM parameter store. How to access the SSM parameter store from the AWS Stepfunction?

Glue Development using Jupyter

Developing and testing the Glue job in the viscose IDE is one of the best development opportunities because Jupyter doesn’t support IDE features. In this blog, I set up a Glue docker instance in the EC2 and use the vscode Jupyter notebook feature to develop Glue jobs. If you want to create more customized your own Docker image, please see AWS Glue run locally.

AWS CFN - Create IGW and NAT

In this post, let’s see how to create Internet Gateway (IGW) and NAT Gateway using Cloudformation (CFN).

Fig.1: Architect Diagram

This post is a continuation of the AWS CFN - Create VPC and subnets.

AWS CFN - Create VPC and subnets

This is a fundamental example of creating AWS VPC and the subnets using AWS Cloudformation(CFN). In the next post, I’ve discussed the AWS CFN - Create IGW and NAT.

Fig.1 VPC architecture

Spark to consume Kafka Stream

A simple PySpark example to show how to consume Kafka stream (given Kafka tutorial).

Kubernetes API

Let’s see how to play K8s in MacOs using MniKube. Some of the topics are very basic such as How to create a namespace and pod in it. Shelling to the pod and after delete pod and the namespace. However, this is written to address the concepts such as configMap, secrets, resource sharing and Helm charts.

RegEx on MacOS

As I understood, RegExs are very useful for general work. Most of the following regular expressions (RegEx)s can be run on the macOS terminal, where you can get the great value of command line tools that have no value without RegExs (grep, sed and so on). In addition, I’ve used some popular tools to explain complex operations later in the document, which have been referenced under the footnotes.

PySpark Date Exmple

PySpark date in string to date type conversion example. How you can use python sql functions like datediff to calculate the differences in days.

Python Sequences

Here python list, tulple basic operations are discussed.

PySpark Data Frame to Pie Chart

I am sharing a Jupyter notebook.

Jenkins in Docker Container

This is the source code to create a Jenkins Docker container.

Java Annotations

Annotations are metadata that provide information at the retention level of Java source, class or runtime.

Understand JPMS

Java Platform Module System (JPMS) has been introduced since Java 9. With Java 9, JDK has been divided into 90 modules. This is a simple example created using IntelliJ IDEA.

module-info

As shown in the above diagram, there are three modules, Application, Service and Provider.

download soruce

Java Thread interrupt

It is important to understand how the Java thread interrupt work.

image-20210326104946921

Source Target Action
New Runnable thread start().
Runnable Blocked synchronized lock on.
Runnable waiting when object call Object.wait().
Runnable timed-waiting when Thread.sleep(...).
Runnable Terminated When thread finished.

Use of default and static methods

A default method added to maintain the backward compatibility which allows older classes (without modifications) to access new version of an interface.

Java 9 interfaces can have private methods and private staic methods. These methods support code reusabilit in the interface level.

Java Nested Classes

Classes can be defined inside other classes to encapsulate logic and constrain the context of use. For example:

TestOrder

Normalization

E. F. Codd proposed three normal forms 1NF, 2NF and 3NF (1970). Revised definition (1974) was given by F. Boyce and Codd which is known as Boyce-Codd Normal Form (BCNF which is 3.5NF) to distinguish it from the old definition of third normal form. R. Faign introduced 4NF(1977) and 5NF(1979) and DKNF(1981). All the normal forms depend on the functional dependency, but 4NF and 5NF have been proposed which are based on the concept of multivalued dependency and joining dependency, respectively.

MapReduce

Hadoop MapReduce well explains the pain is writing too much code for simple MapReduce in Java. This organic blog explains how to use MRJob package in Python to write and execute Movie ratings.

HDFS Basics

After install the sandbox from the Hortonworks, you can visit the http://localhost:50070 page to find the information about the HDFS cluster. YARN job manager can be access via http://localhost:8088.

Java Future

Java Futures are the way to support asynchronuous operations. Learn the basics of Java 9 Parallelism before read this post.

Java Concurrent CompletableFuture

The CompletableFuture has been introduced since JDK 8 (2014). This is a abstraction over the `java.util.concurrent. Learn the basics of Java 9 Parallelism before read this post.

Spring boot property and profile management

Spring property and profile manangement is explained.

Spring boot CLI

This is a short explanation of how to use Spring boot CLI to create project and run in the macOS.

GitHub API.

GitHub API is Hypermedia based. This is an elementary post introducing how to interact with GitHub API using curl and the jq tools.

Quick sort in Python

Quick sort best and the avarage runining time is \(O(n\log{}n)\).

image-20201101122401694

To learn more about Python generators, see python fun.

Selection sort in Python

Selection sort runining time is very high as \(O(N^2)\).

Selection Sort

Binary search in Python

Binary Search is one of the most fundamental algorithm.

Binary Search

I explain the procedural and functional way of binary search algorithm.

Python run on containers

We have alrady explain Website hosted as a container. In this post explained how to host flask web application.

Apache Spark begins with PySpark

PySpark is one of the most popular ways of using Spark. This blog considers the use of the basic of Spark SQL with data frames.

Website hosted as a container

This is very short tutorial to show how to quickly create a web server using Docker container. The Docker should be installed in your machine as a prerequesit.

First step to AWS CDK

This is my first step of using AWS CDK in macOS. I am using Pyenv tool to create python enviroment as explained in the Python my workflow.

Here the simple example created using AWS CDK.

helloaws

Followed AWS CDK Python workshop.

Bash Introdcution

Understand the bash scripting to use in the day-to-day life of the developer.

Rusty terminal tools

This is very intersting blog explain new terminal tools written in Rusty 🚀.

Minikube Introduction

Minikube is a high available cluster for education and prototyping purpose only but not for the production use because of security, performance and stability issues.

Minikube

Lean the basics here.

Mac keyboard shortcut to copy file path as markdown

How to create macOS Automator Quick Action to copy the file/folder markdown path using shortcut keys as the same way you copy the path name of the file/folder.

Elastic Search NER

This post is on ElasticSearch 8 NER1.

Elastic Search Introduction

This post is on ElasticSearch 8 and the Elastic Stack1.