Azure Synapse Analytics cost optimizations

#sparkpool, #deltaLake, #adls-gen2, #pipeline, #azure

Azure Synapse Analytics cost optimizations

image generated using Bing Image Creator

Recently I undertook a task for cost optimization on Azure Synapse Analytics that was incurring a running cost of $200K per month (notice the K at the end).

A glance at the resources and you could easily find the culprit being over-provisioning of resources.

In this blog post, I am documenting my experience and listing down all the culprits.

#Culprit 1 - Sparkpool XXL nodes

Indeed, overprovisioning of the sparkpool cluster can't be avoided but it can be optimized.

when writing a notebook and allocating a new sparkpool for its execution a bottom-up approach in finalizing the cluster and executor size must be followed.

SizevCoreMemory
Small432 GB
Medium864 GB
Large16128 GB
XLarge32256 GB
XXLarge64432 GB

#Culprit 2 - EventTriggered copy data pipelines

You might wonder why is an event-triggered copy data pipeline an issue when it comes to cost. Each pipeline run utilizes the Integration Runtime allocated to it and is charged based on Data Integration Units.

If 50,000 files are to be copied running 50,000 pipeline runs will cost more than batching the copy activity on a schedule trigger.

#Culprit 3 - ADLS availability tier

When configuring your applications and provisioning resources pay extra attention to what kind of data redundancy settings you want.

ParameterLRSZRSGRS/RA-GRSGZRS/RA-GZRS
Percent durability of objects over a given yearat least 99.999999999% (11 9's)at least 99.9999999999% (12 9's)at least 99.99999999999999% (16 9's)at least 99.99999999999999% (16 9's)
Availability for read requestsAt least 99.9% (99% for cool or archive access tiers)At least 99.9% (99% for cool access tier)At least 99.9% (99% for cool or archive access tiers) for GRSAt least 99.99% (99.9% for cool or archive access tiers) for RA-GRS

table source

Even the minimum available setting guarantees enough durability and availability. Again the decision is purely based on the data redundancy requirement of the application, choose what's best.

#Culprit 4 - Delta Lake Fragmented Files

Delta Lake is an excellent framework for almost any ETL or data warehousing requirements, but special attention should be given when writing multiple smaller-size files to Delta Lake tables.

Ingesting multiple smaller-size files creates fragmented files and causes read degradation when utilizing the Delta Lake tables.

This is where file compaction comes to the rescue (only available in v1.2.0 and above). A simple OPTIMIZE command defragments the files and creates 1 or many larger files to reduce read latency.


I was able to bring down the cost of the platform by 50-60% just by tackling the above-mentioned inefficiencies of the code/platform.

Share your thoughts if you have tried to tackle the cost optimization problems!


This is my first ever article on any platform, I would love to hear your feedback about this article :)