Posts
Spark - create database and table
This is a short note to create a Hive meta store using Spark 3.3.1.
Semantic search with ELSER in Elasticsearch
Elastic Learned Sparse EncodeR(ELSER) is a retrieval model trained by Elastic that enables you to perform semantic search to retrieve more relevant search results.
- install ELSER v2: Only once (DevOPs will do for your)
- Create source index where you can insert all your documents
- Create target index
- Create ingestion pipeline
- Reindex process to create embeddings
- Ready to do semantic search using text expansion queries
I created this blog post on docker to demonstrate Linux-optimised ELSER v2. The Elasticsearch version is 8.11.1.
Kafka PySpark streaming example
The diagram shows that the Kafka producer reads from Wikimedia and writes to the Kafka topic. Then Kafka Spark consumer pulls the data from the Kafka topic and writes the steam batches to disk.
Terraform For each iteration
This is to explain Terraform for each looping technique. In this example, 3 buckets are created to demonstrate the looping idea.
In the first step, we will create the above 3 buckets starting from 0.
Spark to create a table in AWS Redshift
In this post, Spark reads the data from a CSV file to a DateFrame and saves that DataFrame as a Redshift table.
In addition to that, I’ve explained how to create a table in Postgres, use Jupyter magics and plot a diagram.
Spark Kafka Docker Configuration
This is the continuation of the Spark Streaming Basics. I explained the basic stream example, which runs only on one AWS Glue container. The stream producer was Netcat, and the sink was a text file. In this post, the stream producer is still Netcat, but the sink is Kafka. Both Kafka and Spark running on Docker containers.
Spark Streaming Basics
This is a very basic example created to explain Spark streaming. Spark run on the AWS Glue container locally.
Introduction to Lambda Calculus
This is a short description of lambda calculus. Lambda calculus is the smallest programming language that is capable of variable substitution and a single function definition scheme. Haskell is the functional programming language based on lambda calculus, which I will explore. I already explained how to use VSCode for Haskell Development to support the code listed here.
Python Parameter passing
Discuss the most possible ways of passing parameters in the python functions.
Python Data Classes
Python Data classes using collections.namedtuple
, typing.NamedTuple
and latest @dataclass
decorator.
Scala - S3 bucket operations
How to list and upload S3 bucket contents using Scala.
Scala - AWS EMR Serverless
AWS EMR Serverless is a cost effective AWS Service to which you can submit Spark Scala jobs.
AWS CI/CD pipeline to Copy files to S3 bucket
Sometime it is necessary to copy files to AWS S3 via CI/CD build pipelines.
Notes on Introduction to Advanced Bash Usage
While I am going through the following, the youtube talk and it’s associated presentation, my hand-ons were recorded here. It is recommended to go through the basics first. You can also refer to the Bash Ref Manual for more information.
Pandas type conversion
Sometimes we need to remove unnecessary data and save the column in the right format in the Pandas data frames.
AWS Glue run locally
This blog explains how to create an AWS Glue container1 to develop PySpark scripts locally. I’ve already explained how to run the Glue locally using Glue Development using Jupyter.
Access AWS SSM via AWS Stepfunctions
Configuration will be availble throughout the pipeline, if that can be stored in the AWS Stepfunctions. Generally congiruation should be stored in the SSM parameter store. How to access the SSM parameter store from the AWS Stepfunction?
Glue Development using Jupyter
Developing and testing the Glue job in the viscose IDE is one of the best development opportunities because Jupyter doesn’t support IDE features. In this blog, I set up a Glue docker instance in the EC2 and use the vscode Jupyter notebook feature to develop Glue jobs. If you want to create more customized your own Docker image, please see AWS Glue run locally.
AWS CFN - Create IGW and NAT
In this post, let’s see how to create Internet Gateway (IGW) and NAT Gateway using Cloudformation (CFN).
This post is a continuation of the AWS CFN - Create VPC and subnets.
AWS CFN - Create VPC and subnets
This is a fundamental example of creating AWS VPC and the subnets using AWS Cloudformation(CFN). In the next post, I’ve discussed the AWS CFN - Create IGW and NAT.
Spark to consume Kafka Stream
A simple PySpark example to show how to consume Kafka stream (given Kafka tutorial).
Kubernetes API
Let’s see how to play K8s in MacOs using MniKube. Some of the topics are very basic such as How to create a namespace and pod in it. Shelling to the pod and after delete pod and the namespace. However, this is written to address the concepts such as configMap, secrets, resource sharing and Helm charts.
RegEx on MacOS
As I understood, RegExs are very useful for general work. Most of the following regular expressions (RegEx)s can be run on the macOS terminal, where you can get the great value of command line tools that have no value without RegExs (grep
, sed
and so on). In addition, I’ve used some popular tools to explain complex operations later in the document, which have been referenced under the footnotes.
PySpark Date Exmple
PySpark date in string to date type conversion example. How you can use python sql functions like datediff
to calculate the differences in days.
Python Sequences
Here python list
, tulple
basic operations are discussed.
PySpark Data Frame to Pie Chart
I am sharing a Jupyter notebook.
Jenkins in Docker Container
This is the source code to create a Jenkins Docker container.
Java Annotations
Annotations are metadata that provide information at the retention level of Java source, class or runtime.
Understand JPMS
Java Platform Module System (JPMS) has been introduced since Java 9. With Java 9, JDK has been divided into 90 modules. This is a simple example created using IntelliJ IDEA.
As shown in the above diagram, there are three modules, Application, Service and Provider.
Java Thread interrupt
It is important to understand how the Java thread interrupt work.
Source | Target | Action |
---|---|---|
New | Runnable | thread start() . |
Runnable | Blocked | synchronized lock on. |
Runnable | waiting | when object call Object.wait() . |
Runnable | timed-waiting | when Thread.sleep(...) . |
Runnable | Terminated | When thread finished. |
Use of default and static methods
A default method added to maintain the backward compatibility which allows older classes (without modifications) to access new version of an interface.
Java 9 interfaces can have private methods and private staic methods. These methods support code reusabilit in the interface level.
Normalization
E. F. Codd proposed three normal forms 1NF, 2NF and 3NF (1970). Revised definition (1974) was given by F. Boyce and Codd which is known as Boyce-Codd Normal Form (BCNF which is 3.5NF) to distinguish it from the old definition of third normal form. R. Faign introduced 4NF(1977) and 5NF(1979) and DKNF(1981). All the normal forms depend on the functional dependency, but 4NF and 5NF have been proposed which are based on the concept of multivalued dependency and joining dependency, respectively.
MapReduce
Hadoop MapReduce well explains the pain is writing too much code for simple MapReduce in Java. This organic blog explains how to use MRJob package in Python to write and execute Movie ratings.
HDFS Basics
After install the sandbox from the Hortonworks, you can visit the http://localhost:50070 page to find the information about the HDFS cluster. YARN job manager can be access via http://localhost:8088.
Java Future
Java Futures
are the way to support asynchronuous operations. Learn the basics of Java 9 Parallelism before read this post.
Java Concurrent CompletableFuture
The CompletableFuture
has been introduced since JDK 8 (2014). This is a abstraction over the `java.util.concurrent. Learn the basics of Java 9 Parallelism before read this post.
Spring boot property and profile management
Spring property and profile manangement is explained.
Spring boot CLI
This is a short explanation of how to use Spring boot CLI to create project and run in the macOS.
GitHub API.
GitHub API is Hypermedia based. This is an elementary post introducing how to interact with GitHub API using curl
and the jq
tools.
Quick sort in Python
Quick sort best and the avarage runining time is \(O(n\log{}n)\).
To learn more about Python generators, see python fun.
Selection sort in Python
Selection sort runining time is very high as \(O(N^2)\).
Binary search in Python
Binary Search is one of the most fundamental algorithm.
I explain the procedural and functional way of binary search algorithm.
Python run on containers
We have alrady explain Website hosted as a container. In this post explained how to host flask web application.
Apache Spark begins with PySpark
PySpark is one of the most popular ways of using Spark. This blog considers the use of the basic of Spark SQL with data frames.
Website hosted as a container
This is very short tutorial to show how to quickly create a web server using Docker container. The Docker should be installed in your machine as a prerequesit.
First step to AWS CDK
This is my first step of using AWS CDK in macOS. I am using Pyenv tool to create python enviroment as explained in the Python my workflow.
Here the simple example created using AWS CDK.
Followed AWS CDK Python workshop.
Bash Introdcution
Understand the bash scripting to use in the day-to-day life of the developer.
Rusty terminal tools
This is very intersting blog explain new terminal tools written in Rusty 🚀.
Minikube Introduction
Minikube is a high available cluster for education and prototyping purpose only but not for the production use because of security, performance and stability issues.
Lean the basics here.
Mac keyboard shortcut to copy file path as markdown
How to create macOS Automator Quick Action to copy the file/folder markdown path using shortcut keys as the same way you copy the path name of the file/folder.
Elastic Search NER
This post is on ElasticSearch 8 NER1.
Elastic Search Introduction
This post is on ElasticSearch 8 and the Elastic Stack1.