Sales Insights - Architecture
Aim: Build a small Azure platform infrastructure that could be used by a hypothetical business to ingest, store and visualise business data. Following this I’ll be looking to apply some further intelligence to their data in the form of customer segmentation, sales forecasting or natural language processing on customer reviews.
Design & Approach

Blob Storage
Act as the file landing zone where raw CSV files can be dropped and stored.
Azure Data Factory
Used to ingest and process files from Blob Storage into the SQL Database.
SQL Database
Database set up in a medallion design to store clean and enrich data.
Tableau
Reporting tool to access and visualise the "Gold Standard" data within the database.
Azure Architecture
As the aim of this project is to keep it very simple and cost effective, the underlying architecture is also quite straight forward, seen in the image below.



Azure Subscription & Resource Groups
Subscriptions and resource groups can largely be covered by a Netflix analogy. There’s the bill payer account (the subscription), and then all the free loading user accounts underneath (resource groups). Whilst it’s one entity that owns and pays for the Netflix subscription, there are multiple accounts within that being used. Just instead of re-watching tv series, people use resource groups for building VMs, storing data or data science pipelines etc.

Blob Storage
Similar to your storage on your laptop, this is a multipurpose file storage for storing all kinds of file types and data. In this use case, it’ll be used as a hypothetical landing zone where files would be dropped for processing. This provides an easy place for Azure Data Factory to find the files.

Azure Data Factory (ADF)
Used for processing data, and lots of it. ADF provides access to all the benefits of cloud (Scalable compute, parallel processing, speed & performance etc.) in a ‘simple’ interface where users can build data pipelines for moving and processing data. In this use case, I built a metadata driven pipeline that looks in the ‘Landing Folder’ for CSV file types and runs an ingestion pipeline for each file. Moving the data from CSV files to a SQL database. The benefit of utilising metadata is rather than building multiple pipelines, you can build one template pipeline and the metadata will dynamically run multiple versions of this template for each variation of the metadata, in this case file name. Building this out further, triggers can be added to run the pipelines based on events such as files landing in the blob storage and then using the pipeline to relocate the files once the data has been transferred into the database.



SQL Database
The core of the solution and where the data sits. Utilising a medallion architecture, which is a design pattern based on the logic of the data improving through Bronze, Silver & Gold stages. The data lands from ADF into the Bronze schema where the data types are incorrect, there may be null or missing values and overall the data needs some work. In the Bronze schema there is an initial data quality stored procedure to conduct some basic processes such as deduplication etc. following this the data is processed to the Silver Schema using another stored procedure that continues the improvement of the data by correcting the data types and handling missing values. Further enrichment occurs in this layer before moving the data into the final Gold layer that will be made accessible to data consumers, in this case we will start with a tableau reporting dashboard.

Outcome:
For next to no cost, a simple data platform was built making use of Azure's free offerings. With the database and blob storage free, within usage limits, data factory only costing a negligible amount for compute and a little bit of time to set everything up. This gives a solid foundation to continue to build on top of. Quite quickly and easily providing a robust data solution that could be used to replace the common excel reports seen in businesses. In the next phase I’ll be building a Tableau report over the top of this new database to provide insights into the data.