Azure Synapse Analytics cost optimizations

image generated using Bing Image Creator

Recently I undertook a task for cost optimization on Azure Synapse Analytics that was incurring a running cost of $200K per month (notice the K at the end).

A glance at the resources and you could easily find the culprit being over-provisioning of resources.

In this blog post, I am documenting my experience and listing down all the culprits.

Permalink#Culprit 1 - Sparkpool XXL nodes

Indeed, overprovisioning of the sparkpool cluster can't be avoided but it can be optimized.

when writing a notebook and allocating a new sparkpool for its execution a bottom-up approach in finalizing the cluster and executor size must be followed.

Size	vCore	Memory
Small	4	32 GB
Medium	8	64 GB
Large	16	128 GB
XLarge	32	256 GB
XXLarge	64	432 GB

Permalink#Culprit 2 - EventTriggered copy data pipelines

You might wonder why is an event-triggered copy data pipeline an issue when it comes to cost. Each pipeline run utilizes the Integration Runtime allocated to it and is charged based on Data Integration Units.

If 50,000 files are to be copied running 50,000 pipeline runs will cost more than batching the copy activity on a schedule trigger.

Permalink#Culprit 3 - ADLS availability tier

When configuring your applications and provisioning resources pay extra attention to what kind of data redundancy settings you want.

Parameter	LRS	ZRS	GRS/RA-GRS	GZRS/RA-GZRS
Percent durability of objects over a given year	at least 99.999999999% (11 9's)	at least 99.9999999999% (12 9's)	at least 99.99999999999999% (16 9's)	at least 99.99999999999999% (16 9's)
Availability for read requests	At least 99.9% (99% for cool or archive access tiers)	At least 99.9% (99% for cool access tier)	At least 99.9% (99% for cool or archive access tiers) for GRS	At least 99.99% (99.9% for cool or archive access tiers) for RA-GRS

table source

Even the minimum available setting guarantees enough durability and availability. Again the decision is purely based on the data redundancy requirement of the application, choose what's best.

Permalink#Culprit 4 - Delta Lake Fragmented Files

Delta Lake is an excellent framework for almost any ETL or data warehousing requirements, but special attention should be given when writing multiple smaller-size files to Delta Lake tables.

Ingesting multiple smaller-size files creates fragmented files and causes read degradation when utilizing the Delta Lake tables.

This is where file compaction comes to the rescue (only available in v1.2.0 and above). A simple OPTIMIZE command defragments the files and creates 1 or many larger files to reduce read latency.

I was able to bring down the cost of the platform by 50-60% just by tackling the above-mentioned inefficiencies of the code/platform.

Share your thoughts if you have tried to tackle the cost optimization problems!

This is my first ever article on any platform, I would love to hear your feedback about this article :)

Azure Synapse Analytics cost optimizations

#sparkpool, #deltaLake, #adls-gen2, #pipeline, #azure

Permalink#Culprit 1 - Sparkpool XXL nodes

Permalink#Culprit 2 - EventTriggered copy data pipelines

Permalink#Culprit 3 - ADLS availability tier

Permalink#Culprit 4 - Delta Lake Fragmented Files