A few months ago, I found a nice little open-source tool on GitHub called AZTK, which provides a fast and easy way to provision low-cost Apache Spark clusters on Microsoft Azure.
In this blog post, I would like to show you, how to install the Azure Distributed Data Engineering Toolkit (AZTK) on your Windows-, Linux- or MacOS-based system, and how to provision your first Apache Spark cluster with it.
Azure Distributed Data Engineering Toolkit (AZTK)
The Azure Distributed Data Engineering Toolkit (AZTK) is a python CLI application for provisioning on-demand Spark on Docker clusters in Azure. It’s a cheap and easy way to get up and running with a Spark cluster, and a great tool for Spark users who want to experiment and start testing at scale.
This toolkit is built on top of Azure Batch but does not require any Azure Batch knowledge to use.
For more details, please have a look on [GitHub].
- Spark cluster provision time of 5 minutes on average
- Spark clusters run in Docker containers
- Run Spark on a GPU enabled cluster
- Users can bring their own Docker image
- Ability to use low-priority VMs for an 80% discount
- Mixed Mode clusters that use both low-priority and dedicated VMs
- Built in support for Azure Blob Storage and Azure Data Lake connection
- Tailored pythonic experience with PySpark, Jupyter, and Anaconda
- Tailored R experience with SparklyR, RStudio-Server, and Tidyverse
- Ability to run spark submit directly from your local machine’s CLI
Install Python 3
Before you install the Azure Distributed Data Engineering Toolkit, you need Python 3, as well as pip3, installed on your system.
To do this, please have a look at Python.org.
Virtual Environment (optional)
After that, I recommend to create a separate virtual environment for the toolkit.
Once you’ve created a virtual environment, you may activate it.
On Windows, run:
On Unix or MacOS, run:
Now you’re ready to install the Azure Distributed Data Engineering Toolkit (AZTK) with a simple:
After you installed the toolkit, you’re ready to create your first Aztk environment. To do that, you simply call
This command creates a .aztk folder in your current directory with the following file structure:
If you want to create a machine wide configuration, you add the –global parameter to the command.
Azure Resources and Credentials
To be able to work with the toolkit, you have to provision a few Azure resources, e.g. Azure Batch, Azure Storage Account, an Service Principal, etc.
After answering a few questions, the command return the required settings, which you add/update in the .aztk/secrets.yaml file.
Provision your first Apache Spark cluster
Finally we’re ready to provision our first Apache Spark cluster using the AZTK.
- With the id parameter, you specify an unique ID (within your Azure Batch account) for your cluster.
- The size parameter specifies the amount of dedicated virtual machines (which are charged at the full price).
- The size-low-priority parameter specifies the amount of Low-Priority Virtual Machines (which are charged at the 20% of the regualar price). This, of course, comes with a disadvantage. If Azure needs the virtual machines for another customer, they will be deleted.
- The vm-size parameter specifies the type of the virtual machines to use.
You’re also able to use the Azure N-Series virtual machines to provision GPU enabled clusters.
Getting Cluster Information
As soon as a cluster is provisioning, existing or deleting, you can use the following commands to get more details:
Connect to the cluster
With the following command, you’re able to ssh-connect to the master node of your cluster, as well as do a port forwarding to the services (and plugins) on the cluster.
After the connection has been established, you can use the port forwarding to access services like the Spark Web UI.
Deleting a cluster
Last but not least, don’t forget to delete the cluster if you don’t need it anymore.
To see a demo of how to setup AZTK and provision your first Spark cluster, I created a short video: