Azure Synapse Analytics cost optimizations
#sparkpool, #deltaLake, #adls-gen2, #pipeline, #azure
image generated using Bing Image Creator
Recently I undertook a task for cost optimization on Azure Synapse Analytics that was incurring a running cost of $200K per month (notice the K at the end).
A glance at the resources and you could easily find the culprit being over-provisioning of resources.
In this blog post, I am documenting my experience and listing down all the culprits.
#Culprit 1 - Sparkpool XXL nodes
Indeed, overprovisioning of the sparkpool cluster can't be avoided but it can be optimized.
when writing a notebook and allocating a new sparkpool for its execution a bottom-up approach in finalizing the cluster and executor size must be followed.
Size | vCore | Memory |
Small | 4 | 32 GB |
Medium | 8 | 64 GB |
Large | 16 | 128 GB |
XLarge | 32 | 256 GB |
XXLarge | 64 | 432 GB |
#Culprit 2 - EventTriggered copy data pipelines
You might wonder why is an event-triggered copy data pipeline an issue when it comes to cost. Each pipeline run utilizes the Integration Runtime allocated to it and is charged based on Data Integration Units.
If 50,000 files are to be copied running 50,000 pipeline runs will cost more than batching the copy activity on a schedule trigger.
#Culprit 3 - ADLS availability tier
When configuring your applications and provisioning resources pay extra attention to what kind of data redundancy settings you want.
Parameter | LRS | ZRS | GRS/RA-GRS | GZRS/RA-GZRS |
Percent durability of objects over a given year | at least 99.999999999% (11 9's) | at least 99.9999999999% (12 9's) | at least 99.99999999999999% (16 9's) | at least 99.99999999999999% (16 9's) |
Availability for read requests | At least 99.9% (99% for cool or archive access tiers) | At least 99.9% (99% for cool access tier) | At least 99.9% (99% for cool or archive access tiers) for GRS | At least 99.99% (99.9% for cool or archive access tiers) for RA-GRS |
Even the minimum available setting guarantees enough durability and availability. Again the decision is purely based on the data redundancy requirement of the application, choose what's best.
#Culprit 4 - Delta Lake Fragmented Files
Delta Lake is an excellent framework for almost any ETL or data warehousing requirements, but special attention should be given when writing multiple smaller-size files to Delta Lake tables.
Ingesting multiple smaller-size files creates fragmented files and causes read degradation when utilizing the Delta Lake tables.
This is where file compaction comes to the rescue (only available in v1.2.0 and above). A simple OPTIMIZE command defragments the files and creates 1 or many larger files to reduce read latency.
I was able to bring down the cost of the platform by 50-60% just by tackling the above-mentioned inefficiencies of the code/platform.
Share your thoughts if you have tried to tackle the cost optimization problems!
This is my first ever article on any platform, I would love to hear your feedback about this article :)