My Notes

Data Engineering DevOps Email Kafka Kubernetes macOS Mongo DB Productivity Programming Python Powershell REST RDBMS SCM Security Spark Unix Tools

Notes on Data Engineering



Data Lineage

mindmap
  (("Lineage"))
    ["Coarse
    Grained"]
      ["Propagate across
      Storage Tiers"]
        ["ETL"]
    ["Fine
    Grained"]
      ["Column 
      level"]

You can use the Spline agent to capture runtime lineage information from Spark jobs, powered by AWS Glue. For those interested in lineage tracking for workloads combining graph data and machine learning, Amazon Web Services announced Amazon SageMaker ML Lineage Tracking at re: Invent 20211.

SageMaker ML Lineage Tracking integrates with SageMaker Pipelines, creates and stores information about the steps of automated ML workflows from data preparation to model deployment.

Partitioning

Data skew in partitioning occurs when your data is not evenly distributed across partitions, resulting in some partitions holding significantly more records (or bytes) than others.

AWS SNS

Amazon SNS is a managed pub/sub messaging service for reliable real-time event notifications to multiple recipients. Subscribers can be different AWS services.

πŸ“ Amazon SNS does not guarantee message ordering. For strict message ordering, Amazon SQS FIFO queues would be more appropriate.

SNS can deliver:

  1. Application to a person (email, SMS or mobile push)

  2. Application to Application

Topic

A topic decouples message publishers from subscribers. Publishers send messages to a topic, and all subscribers to that topic receive the same messages. Two type of topics:

SNS supported protocols

πŸ“ Standard topics use at-least-once delivery and best-effort ordering. FIFO topics use no duplication and first-in-first-out ordered delivery.

SNS support a Fan-out pattern because multiple subscribers can subscribe to a topic simultaneously.

SNS is capable of message filtering. AWS Lambda can directly integrate with SNS and trigger automated actions based on published messages.

AWS SQS

Amazon SQS acts as an intermediary to

AWS Services Integrations:

Queue

  1. Standard queues: Application can process messages that arrive more than once and out of order, standard queues use
    1. At-Least-Once delivery and
    2. Best-Effort ordering
  2. FIFO queues: Designed to enhance messaging between applications when the order of operations and events is critical, or where duplicates can’t be tolerated. FIFO queues use
    1. Exactly-Once processing and
    2. First-In-First-Out delivery

Message

Text or binary data, up to 256 KB in size.

Message groups include a tag that specifies that a message belongs to a specific message group. Messages of the same message group are always processed in sequence, in a strict order relative to the message group.

πŸ“ Messages that belong to different message groups might be processed out of order.

Message deduplication ID is a token used to prevent duplication in FIFO queues only.

DLQ

A dead-letter queue (DLQ) is a separate queue.

DQL holds messages that could not be successfully processed by a consumer after a specified number of attempts.

Visibility timeout

The visibility timeout is the amount of time for a message to be invisible to other consumers after a one consumer picks that message.

πŸ“ This technique prevents duplicate processing of messages.

Short polling

Consumer queries to find available messages and gets an immediate response, even if no messages are found.

Long polling

Consumer can wait for messages to arrive in an SQS queue up to a configurable timeout, rather than continually polling the queue for new messages.

πŸ“ This method can reduce the number of empty responses and subsequent requests made.

Batching

SQS provides batch actions to help you reduce costs and manipulate up to 10 messages with a single action. These batch actions include,

Delay Queues

Postpone the delivery of new messages to consumers for a number of seconds when consumer needs additional time to process messages.

πŸ“ Any messages that you send to the queue remain invisible to the consumers for the duration of the delay period set.

AWS lambda

Say your loabmda function is lambda_function.py:

zip lambda_function.zip lambda_function.py

Uisng AWS CLI:

aws lambda update-function-code  --function-name <function name> --zip-file fileb://lambda_function.zip

Above you can do on Sagemaker AI terminal for example.

To get grep the S3 bucket to a variable:

WEB_BUCKET=`aws s3 ls | grep www- | awk '{ print $3 }'`

##