Skip to content

Implemented Azure Databricks for real-time data processing and governance using Unity Catalog, Spark Structured Streaming, Delta Lake features, Medallion Architecture, and end-to-end CI/CD pipelines. Focused on incremental loading, compute cluster management, maintaining data quality, and creating workflows.

Notifications You must be signed in to change notification settings

Srilekha-1106/databricksProject

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 

Repository files navigation

Real-Time Data Processing with Unity Catalog and CI/CD in Azure Databricks

Azure Databricks Project Setup and Automation

Project Overview

This project involves setting up an Azure Databricks environment, integrating it with Azure storage accounts, automating data processing workflows, and implementing CI/CD pipelines to ensure seamless integration and deployment of data and notebooks.

Steps and Implementation

1. Azure Resource Group Creation

An Azure Resource Group was created to organize and manage all related resources. image

2. Storage Accounts Setup

Two storage accounts were created to store and manage the project data. image

3. Container Configuration

Within the projectstgaccount, three containers were created, with the landing container designated for storing raw data. image image

4. Medallion Folder Structure

Three folders were created in the medallion structure to organize data systematically. image

5. Azure Databricks Workspace Setup

An Azure Databricks workspace was established to facilitate data processing and analysis. image

6. Databricks Access Connector

A Databricks access connector was created and added to the Blob Storage Contributor role for the two storage accounts, ensuring secure data access.

image image image

7. Databricks Metastore and Catalog

Within the Azure Databricks workspace, a metastore was created and attached to the workspace. Subsequently, a development catalog was set up.

image image image

8. Storage Credentials and External Locations

Storage credentials and external locations were configured to manage data access and storage. image image

9. Cluster Creation

A Databricks cluster was created to execute data processing tasks. image

10. File Verification

All provided files were manually run to verify that paths and variable names were correctly defined. image

All schemas are created in the dev catalog image

11. Autoscaling and Workflow Creation

Autoscaling was enabled, and workflows were created to automate the execution of data processing tasks. image

12. Dbutils Widgets

Keys and parameters for dbutils widgets were created to handle dynamic configurations. image

13. Trigger Creation and Incremental Data Processing

Triggers were created to automate task execution. Multiple triggers were cloned to manage different data streams, such as raw roads and raw traffic. New files added to Azure Data Lake Storage (ADLS) initiate the triggers, ensuring incremental data processing and successful job completion. image

14. Data Reporting

Processed data was integrated with Power BI for comprehensive reporting and analysis. image

15. CI/CD Pipeline Setup

A CI/CD pipeline was established to automate the deployment process. When there is a push to the main branch, all folders are copied to the live folder, requiring admin access for interaction. This setup ensures seamless integration and deployment of all notebooks to different environments, keeping the live folder updated with the latest data.

image

Conclusion

This project demonstrates the efficient setup and automation of an Azure Databricks environment. It includes secure data integration, automated workflows, and comprehensive reporting, enhanced by a robust CI/CD pipeline to ensure consistent and up-to-date data deployment across different environments. This approach facilitates seamless integration, deployment, and data accessibility while maintaining data integrity and security.

At the end of the project, the workspace appears as shown in the image. It includes the main branch with all the changes already pulled, all the files organized Notebooks folder. The setup ensures all project components are easily accessible and well-structured. A pipeline was created to facilitate the movement of data from the development catalog to the UAT catalog, requiring admin approval. The Azure DevOps interface illustrates the stages of deployment, ensuring a controlled and authorized transition of data between these environments.

About

Implemented Azure Databricks for real-time data processing and governance using Unity Catalog, Spark Structured Streaming, Delta Lake features, Medallion Architecture, and end-to-end CI/CD pipelines. Focused on incremental loading, compute cluster management, maintaining data quality, and creating workflows.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published