Highlight
If your developers are building notebooks directly in Azure Databricks portal, then you can quickly enhance their productivity but adding a simple CI/CD pipelines with Azure DevOps. In this article I’ll show you how!
First of all I want to explain two different approaches to developing notebooks in Databricks portal. This article explains how to implement CI/CD for development in the portal. If you are developing locally with PySpark there are different approches. I call this one a simplistic approach that many projects can do without much change to their current ways of working.
The basics
Let’s start with how new git integration works. This new way is called ‘Repos’. It’s also called ‘new’ because there is a different and older feature that also was called git integration which integrated specific notebooks with git. But I never liked it.
Documentation Reference: https://learn.microsoft.com/en-us/azure/databricks/repos/
Databricks Git Integration
Azure Databricks has a following folder structure
Workspace
├───Users
│ ├── Adam
│ └── Tom
├───Shared
│ ├── X
│ └── Y
└───Repos
├── A
├── B
├── Adam
└── Tom
Even though in the Databricks portal Repos menu item is on the same level as the workspace, everything that you do in it is stored in a Workspace folder under Repos subfolder.
That said, Repos is the ‘new’ way of doing git integration in Azure Databricks. IMO, don’t use the old ‘per file versioning with git’. The way I like to explain it, is that ‘Repos’ is justa local development environment/machine hosted in the cloud (since you can’t have databricks running locally). The only real way to develop locally is with PySpark with Hadoop and then just deploy to Databricks, but that’s a story for another day.
So let’s look at the Repos for a second. Take below structure as an example of repos setup.
└───Repos
├── A - shared folder for purpose A
├── B - shared folder for purpose B
├── Adam - Adam's private virtual development environment/space
└── Tom - Tom's private virtual development environment/space
In here we can either share shared folders which many people can contribute to concurrently or people can have their own folders. So… which one is preferred?
Two options of working on the code
There are two schools of how this is done right now (at least for what I have seen) when using repos
- Option 1 - like in ADF - Create shared folders per branch and let multiple people work in there at the same time
- Option 2 - like for a web-dev - This is closer to what you see in web development, where developers have their own machines with folders checked out to branches and pull/push every time they implement changes
Both options have their merits but also small issues. Option 1 is simpler as it more ‘feels’ like development in a Data Factory where multiple developers work on the same branch without needing to ‘pull’ the code constantly. This folder structure would look like this
└───Repos
├── Feature/8236_dimCustomer - shared folder for feature 8236
└── Feature/3232_factVolume - shared folder for feature 3232
Developers collaborate in those folder and once everything works they only need push it to the Azure DevOps repo, pulling is easy, no merge conflicts ever
Option 2 is more closer to web-dev so the structure would look like this (let’s assume Adam & Tom work on dimCutsomer feature)
└───Repos
├── Adam
│ └── Feature/8236_dimCustomer - Adam's folder for feature 8236
└── Tom
└── Feature/8236_dimCustomer - Tom's folder for feature 8236
So when they implement changes they need to constantly pull/push code not to make any conflicts. But it works like in a classic approach of web dev.
One issue
Now, there is one issue with Databricks repos in general, regardless of which option you will go with. There is no way to LOCK folder to a branch. By default it should look like this
└───Repos
├── Main - Checked out to Main
└── Feature/8236_dimCustomer - Checked out to Feature/8236_dimCustomer
But user can perform git checkout (using UI) at any time without any warning and change the current working branch, ending up with situation like this
└───Repos
├── Main - Checked out to Feature/8236_dimCustomer
└── Feature/8236_dimCustomer - Checked out to ABCD
Now from the perspective of the file system, folder is called Main but the files in it are really from the Feature/8236_dimCustomer branch. Not only developers can mess up the code, but they can look for issues before realizing they are using the wrong branch because folder is called Main.
Unfortunately there is no solution to this, but one conclusion that comes to my mind. Repos should be considered only for ‘development’ of the code, but not for the releases and ingratiation in external tools like data factory.
CI/CD process
Because of above, I’ve decided to do a CI/CD process which for every branch and every push automatically create a current release folder in the workspace.
Option 1
Workspace
├───Releases
│ ├── Main - current main
│ └── Feature/8236_dimCustomer - current Feature/8236 code
└───Repos
└── Feature/8236_dimCustomer - shared folder for feature 8236
Option 2
Workspace
├───Releases
│ ├── Main - current main
│ └── Feature/8236_dimCustomer - current Feature/8236 code
└───Repos
├── Adam
│ └── Feature/8236_dimCustomer - Adam's folder for feature 8236
└── Tom
└── Feature/8236_dimCustomer - Tom's folder for feature 8236
So my CI/CD process doesn’t care if Adam pushed the code, or Tom did. It doesn’t care if they shared a folder or they didn’t. It only cares what is in Azure DevOps and releases that to the workspace folder.
CI/CD pipeline
And this is done via this YML pipeline
databricks-deploy-stage.yml generic reusable template for all environments (dev/test/prod)
NOTE: Yes, I know there is Azure Databricks action in the marketplace, but I couldn’t install it due to client policies, so I wrote bash script.
parameters:
- name: stageName
type: string
stages:
- stage: ${{ parameters.stageName }}
variables:
- group: ${{ parameters.stageName }}
jobs:
- deployment: ${{ parameters.stageName }}
environment: '${{ parameters.stageName }}'
strategy:
runOnce:
deploy:
steps:
- script: pip install databricks-cli
displayName: "install databricks-cli"
- script: |
echo "$(databricksHost)
$(databricksToken)" | databricks configure --token
displayName: 'configure databricks-cli'
- task: DownloadPipelineArtifact@2
inputs:
source: current
artifact: 'Databricks'
downloadPath: $(System.ArtifactsDirectory)/databricks
- script: 'ls $(System.ArtifactsDirectory)/databricks'
- script: |
BRANCH_NAME=$(echo "$(BranchName)" | awk -F/ '{print $NF}')
FOLDER=$(echo /$(Foldername)/$BRANCH_NAME)
echo $FOLDER
folder=$(databricks workspace ls --id $FOLDER)
if [[ "$folder" = Error* ]] ; then
echo "Folder $FOLDER not found. Skipping..."
else
echo "Deleting $FOLDER"
databricks workspace rm $FOLDER --recursive
fi
displayName: 'Delete old release'
- script: |
BRANCH_NAME=$(echo "$(BranchName)" | awk -F/ '{print $NF}')
FOLDER=$(echo /$(Foldername)/$BRANCH_NAME)
echo $FOLDER
databricks workspace import_dir $(System.ArtifactsDirectory)/databricks $FOLDER --exclude-hidden-files
displayName: 'New release'
azure-pipelines-release-dev.yml - actual build & release pipeline (all in one, your choice), triggers on every change
trigger:
branches:
include:
- '*'
pool:
vmImage: ubuntu-latest
variables:
- name: branchName
value: $(Build.SourceBranch)
- name: folderName
value: release
stages:
- stage: BUILD
jobs:
- job: BUILD
steps:
- task: PublishPipelineArtifact@1
inputs:
targetPath: '$(Build.Repository.LocalPath)/'
artifact: 'Databricks'
publishLocation: 'pipeline'
- template: databricks-deploy-stage.yml
parameters:
stageName: DEV
They key part here is triggering on every change
trigger:
branches:
include:
- '*'
azure-pipelines-release-test-prod.yml - manually triggered deployment to TEST and PROD
trigger:
- none
pool:
vmImage: ubuntu-latest
variables:
- name: branchName
value: $(Build.SourceBranch)
- name: folderName
value: release
stages:
- stage: BUILD
jobs:
- job: BUILD
steps:
- task: PublishPipelineArtifact@1
inputs:
targetPath: '$(Build.Repository.LocalPath)/'
artifact: 'Databricks'
publishLocation: 'pipeline'
- template: databricks-deploy-stage.yml
parameters:
stageName: TEST
- template: databricks-deploy-stage.yml
parameters:
stageName: PROD
Branching policies
Now I just ensure that people need to create feature branches, I let them decide however they want to use Repos, sharing folders or per user. From my perspective I only need code to be pushed to repo.
Also, in data factory I only use Workspace/Release/<branch_name>/<notebook_name> in the pipelines. Branch name is a global ADF parameter in ADF which points to the current branch. I override it via DevOps pipelines.
I only release ‘main’ to other environments.
Summary
I hope this highlight the new and quick way you can add CI/CD DevOps practices to your project team without huge change of their current ways of working. In future, I’d still advise to learn and build solutions using PySpark for several extra dvantages, but for Today that’s it. I hope it helps!