Neobanks and Workflows — Ingestion (Design)

Noel Angelo Borneo
5 min readSep 13, 2023

--

Hello! Welcome to my blog series: Neobanks and Workflows! In this series, I write about what I’ve learned in my data engineering journey and try to apply that into a specific topic that I’m passionate about — fintech 💵.

From left to right, credits given to Up Bank, and logos of MinIO, Delta and Apache Airflow are credited to @Chu Quang Bach from Excalidraw

Objective

In this three part series, we will dive into the first process of most data engineering projects— data ingestion, and examine how we can ingest data from a well-known neobank operating in Australia called Up.

  1. Design: ← this article
  2. Environment Setup: (Docker, MinIO, Delta, Airflow)
  3. Build: (Jupyter, DAGs, API requests)

By the end of this series, we can expect the following:

  • built a locally working data pipeline (local setup as it’s easier to manage financial data locally than storing it in the cloud)
  • understanding and exposure on popular open-source tools

The goal of this series is to try and provide simple analogies on the design of an ingestion pattern and learn the concepts that we’re covering through development.

Context

Let’s start by talking about a few basic data engineering ideas that will help us understand what I’m writing about: data source, staging directory, and landing zones.

Design pattern for ingesting Up bank data (Image by Author)

Data Source — Up Bank

A data source is like a place where we get the information or the data that we need to retrieve. In our case it is from an API that is provided by Up which allows us to collect the data for analysis or for other purposes. This will be our starting point.

A sample transaction data that can be retrieved from the API

further context (API): Imagine that we’re in a restaurant and our “data” is the food that we’d like to order. Think of an API as the waiter — responsible for taking our order (request) and providing us with our food (response body) along with our receipt (response metadata). I will dive deeper on the topic of APIs in a separate article.

Q: Why Up? — my personal opinion is that Up is the only neobank in Australia that offers a mature and reliable API service alongside with its ability to categorise purchases. It feels like the product had a focus on providing its service to developers much like Stripe. More information on the API specification can be found here.

Staging Directory — MinIO

The data we get from Up’s API is in a format called JavaScript Object Notation or JSON. To store this kind of data, one option is to use a storage system called an “object store.” MinIO will be our provider for this, and it’s similar to services like AWS S3 or Azure Blob Storage.

further context (object store): We want to think of this as similar to popular tools like Dropbox or OneDrive where it allows us to upload and store most types of files in the cloud. The files that we’re going to be uploading will all be placed in a single directory hence the name staging “directory”.

A great article explaining this in more detail can be found here.

Q: Why do we need this? — think of this as the jet bridge in the airport where our JSON data (passengers) hop on before they are seated (organised into rows and columns). We’ll create a local setup that imitates a cloud object store using MinIO.

Landing Zone — Delta

Once the data has reached our staging directory, we are now able to use the information stored in the files and structure it in a tabular format through Delta. The tables that we have generated as a result of this exercise marks the completion of our ingestion exercise!

further context (bronze layer): this part of our pipeline is from a data design pattern called the Medallion architecture that was coined by Databricks.

Another article worth reading which goes through this in more detail can be found here.

Q: Can’t we just use a traditional database, why do we need this? — yes we can! a traditional database can also be an alternative solution in storing the data we’re collecting. I’ll be writing a separate post for this but for this one I’m focusing on the approach that’s more common to projects that I’m exposed to — Data Lakehouse Architecture.

Processing

How we will be collecting our data from the API follows a batch processing approach where the tools we employ collects the data on a regular schedule — in our case it will be daily. This ingestion approach follows a “pull” based method as we’re pulling transactions or information from our bank through their API. The reference on what the workflow looks like is listed below, but let’s go through it step by step.

Workflow of our Ingestion pattern

Extraction

We will build our own Python scripts which would be responsible for pulling the data through REST API. There’s good documentation listed in API spec on how to use their API and we will go through that in our Build phase in this series.

Collection

Once we’ve extracted the data or have taken the responses from the API we will be storing them in our object store as JSON objects.

Ingestion

The final step in our workflow is to use Spark to create our own Delta tables. The tables that we create in this step are what we call bronze tables. We can read more about this from the medallion architecture found here.

One constraint to raise here is that our approach is typically slower compared to the alternative approach of processing data — stream processing wherein the data comes into our storage in almost real-time. The reason for this is just around the complexity of how we build our pipeline, typically a batch processing approach has less overhead. For this instance, this is where our orchestration Airflow comes into play.

TLDR: our ingestion process is done by pulling data from the API every day.

More information on these approaches can be found here.

Summary

In this article, we’ve taken a quick tour of the various components that we’ll be working with in this project. We’ve also discussed new concepts and some of the reasoning behind the decisions made during the project’s development. In the next part of this series, we’ll dive into the steps for setting up our local environment and get started on building our very own the data pipeline.

Now let’s setup our workbench to build some pipes where our data will flow!

--

--