Highlight

If your developers are building notebooks directly in Azure Databricks portal, then you can quickly enhance their productivity but adding a simple CI/CD pipelines with Azure DevOps. In this article I’ll show you how!

First of all I want to explain two different approaches to developing notebooks in Databricks portal. This article explains how to implement CI/CD for development in the portal. If you are developing locally with PySpark there are different approches. I call this one a simplistic approach that many projects can do without much change to their current ways of working.

The basics

Let’s start with how new git integration works. This new way is called ‘Repos’. It’s also called ‘new’ because there is a different and older feature that also was called git integration which integrated specific notebooks with git. But I never liked it.

Documentation Reference: https://learn.microsoft.com/en-us/azure/databricks/repos/

Databricks Git Integration

Azure Databricks has a following folder structure

Workspace
├───Users
│   ├── Adam
│   └── Tom
├───Shared
│   ├── X
│   └── Y
└───Repos
    ├── A
    ├── B
    ├── Adam
    └── Tom

Even though in the Databricks portal Repos menu item is on the same level as the workspace, everything that you do in it is stored in a Workspace folder under Repos subfolder.

That said, Repos is the ‘new’ way of doing git integration in Azure Databricks. IMO, don’t use the old ‘per file versioning with git’. The way I like to explain it, is that ‘Repos’ is justa local development environment/machine hosted in the cloud (since you can’t have databricks running locally). The only real way to develop locally is with PySpark with Hadoop and then just deploy to Databricks, but that’s a story for another day.

So let’s look at the Repos for a second. Take below structure as an example of repos setup.

└───Repos
    ├── A           - shared folder for purpose A
    ├── B           - shared folder for purpose B
    ├── Adam        - Adam's private virtual development environment/space
    └── Tom         - Tom's private virtual development environment/space

In here we can either share shared folders which many people can contribute to concurrently or people can have their own folders. So… which one is preferred?

Two options of working on the code

There are two schools of how this is done right now (at least for what I have seen) when using repos

  • Option 1 - like in ADF - Create shared folders per branch and let multiple people work in there at the same time
  • Option 2 - like for a web-dev - This is closer to what you see in web development, where developers have their own machines with folders checked out to branches and pull/push every time they implement changes

Both options have their merits but also small issues. Option 1 is simpler as it more ‘feels’ like development in a Data Factory where multiple developers work on the same branch without needing to ‘pull’ the code constantly. This folder structure would look like this

└───Repos                                           
    ├── Feature/8236_dimCustomer       - shared folder for feature 8236
    └── Feature/3232_factVolume        - shared folder for feature 3232

Developers collaborate in those folder and once everything works they only need push it to the Azure DevOps repo, pulling is easy, no merge conflicts ever

Option 2 is more closer to web-dev so the structure would look like this (let’s assume Adam & Tom work on dimCutsomer feature)

└───Repos    
    ├── Adam       
    │   └── Feature/8236_dimCustomer        - Adam's folder for feature 8236
    └── Tom 
        └── Feature/8236_dimCustomer        - Tom's folder for feature 8236

So when they implement changes they need to constantly pull/push code not to make any conflicts. But it works like in a classic approach of web dev.

One issue

Now, there is one issue with Databricks repos in general, regardless of which option you will go with. There is no way to LOCK folder to a branch. By default it should look like this

└───Repos
    ├── Main                            - Checked out to Main
    └── Feature/8236_dimCustomer        - Checked out to Feature/8236_dimCustomer

But user can perform git checkout (using UI) at any time without any warning and change the current working branch, ending up with situation like this

└───Repos
    ├── Main                            - Checked out to Feature/8236_dimCustomer
    └── Feature/8236_dimCustomer        - Checked out to ABCD

Now from the perspective of the file system, folder is called Main but the files in it are really from the Feature/8236_dimCustomer branch. Not only developers can mess up the code, but they can look for issues before realizing they are using the wrong branch because folder is called Main.

Unfortunately there is no solution to this, but one conclusion that comes to my mind. Repos should be considered only for ‘development’ of the code, but not for the releases and ingratiation in external tools like data factory.

CI/CD process

Because of above, I’ve decided to do a CI/CD process which for every branch and every push automatically create a current release folder in the workspace.

Option 1

Workspace
├───Releases
│   ├── Main                              - current main
│   └── Feature/8236_dimCustomer          - current Feature/8236 code
└───Repos                                 
    └── Feature/8236_dimCustomer          - shared folder for feature 8236

Option 2

Workspace
├───Releases
│   ├── Main                              - current main
│   └── Feature/8236_dimCustomer          - current Feature/8236 code
└───Repos                                 
    ├── Adam                              
    │   └── Feature/8236_dimCustomer      - Adam's folder for feature 8236
    └── Tom                               
        └── Feature/8236_dimCustomer      - Tom's folder for feature 8236

So my CI/CD process doesn’t care if Adam pushed the code, or Tom did. It doesn’t care if they shared a folder or they didn’t. It only cares what is in Azure DevOps and releases that to the workspace folder.

CI/CD pipeline

And this is done via this YML pipeline

databricks-deploy-stage.yml generic reusable template for all environments (dev/test/prod)

NOTE: Yes, I know there is Azure Databricks action in the marketplace, but I couldn’t install it due to client policies, so I wrote bash script.

parameters:
- name: stageName 
  type: string 

stages:
- stage: ${{ parameters.stageName }}
  variables:
  - group: ${{ parameters.stageName }}
  jobs:
  - deployment: ${{ parameters.stageName }}
    environment: '${{ parameters.stageName }}'
    strategy:
     runOnce:
       deploy:
         steps:
         - script: pip install databricks-cli
           displayName: "install databricks-cli"
 
         - script: |
             echo "$(databricksHost)
             $(databricksToken)" | databricks configure --token
           displayName: 'configure databricks-cli'
 
         - task: DownloadPipelineArtifact@2
           inputs:
             source: current
             artifact: 'Databricks'
             downloadPath: $(System.ArtifactsDirectory)/databricks
 
         - script: 'ls $(System.ArtifactsDirectory)/databricks'
 
         - script: | 
             BRANCH_NAME=$(echo "$(BranchName)" | awk -F/ '{print $NF}')
             FOLDER=$(echo /$(Foldername)/$BRANCH_NAME)
             echo $FOLDER
             folder=$(databricks workspace ls --id $FOLDER)
             if [[ "$folder" = Error* ]] ; then
             echo "Folder $FOLDER not found. Skipping..."
             else
             echo "Deleting $FOLDER"
             databricks workspace rm $FOLDER --recursive
             fi
           displayName: 'Delete old release'
 
         - script: |
             BRANCH_NAME=$(echo "$(BranchName)" | awk -F/ '{print $NF}')
             FOLDER=$(echo /$(Foldername)/$BRANCH_NAME)
             echo $FOLDER
             databricks workspace import_dir $(System.ArtifactsDirectory)/databricks $FOLDER --exclude-hidden-files
           displayName: 'New release'

azure-pipelines-release-dev.yml - actual build & release pipeline (all in one, your choice), triggers on every change

trigger:
 branches:
   include:
     - '*'

pool:
  vmImage: ubuntu-latest

variables:
- name: branchName
  value: $(Build.SourceBranch)
- name: folderName
  value: release

stages:
- stage: BUILD
  jobs:
  - job: BUILD
    steps:
    - task: PublishPipelineArtifact@1
      inputs:
        targetPath: '$(Build.Repository.LocalPath)/'
        artifact: 'Databricks'
        publishLocation: 'pipeline'

- template: databricks-deploy-stage.yml
  parameters: 
      stageName: DEV

They key part here is triggering on every change

trigger:
 branches:
   include:
     - '*'

azure-pipelines-release-test-prod.yml - manually triggered deployment to TEST and PROD

trigger:
- none

pool:
  vmImage: ubuntu-latest

variables:
- name: branchName
  value: $(Build.SourceBranch)
- name: folderName
  value: release

stages:
- stage: BUILD
  jobs:
  - job: BUILD
    steps:
    - task: PublishPipelineArtifact@1
      inputs:
        targetPath: '$(Build.Repository.LocalPath)/'
        artifact: 'Databricks'
        publishLocation: 'pipeline'

- template: databricks-deploy-stage.yml
  parameters: 
      stageName: TEST

- template: databricks-deploy-stage.yml
  parameters: 
      stageName: PROD

Branching policies

Now I just ensure that people need to create feature branches, I let them decide however they want to use Repos, sharing folders or per user. From my perspective I only need code to be pushed to repo.

Also, in data factory I only use Workspace/Release/<branch_name>/<notebook_name> in the pipelines. Branch name is a global ADF parameter in ADF which points to the current branch. I override it via DevOps pipelines.

I only release ‘main’ to other environments.

Summary

I hope this highlight the new and quick way you can add CI/CD DevOps practices to your project team without huge change of their current ways of working. In future, I’d still advise to learn and build solutions using PySpark for several extra dvantages, but for Today that’s it. I hope it helps!

Adam Marczak

Programmer, architect, trainer, blogger, evangelist are just a few of my titles. What I really am, is a passionate technology enthusiast. I take great pleasure in learning new technologies and finding ways in which this can aid people every day. My latest passion is running an Azure 4 Everyone YouTube channel, where I show that Azure really is for everyone!

Did you enjoy the article?

Share it!

More tagged posts