# 5. Boring data pipeline

Date: 2020-03-25
Driver: Wisen Tanasa

# Status

Accepted

# Context

We needed to have a base data set in our production to make it usable. We have multiple data sets that are already available on hand, but most of them will need to be analyzed, sanitized, and combined when necessary.

When combining multiple data sets, the data sets that we have are not easily combined as there are no unique fields that can be used to match buildings together. 1 building in one data set might have a postcode of E8 1FT, and the other data set for the same building might have a post code of E8 2XY. The address fields are also not exactly the same.

This requirement calls a decision to be made on what technologies and architectural patterns we want to use. With the advent of big data technologies, there are many known technologies and patterns out there that we might be able to leverage. There are services in AWS like AWS Glue and Athena that we could explore. These big data technologies might influence the programming language of our choice as well. AWS Glue for example only supports Python and Scala.

We also understood that all of the data sets that we have are small, most of them are less than 1 GB.

# Decision

We will proudly go boring, and implement the data pipeline by:

  • Using the technology and language that we're already using
  • Involving manual intervention when the data sets are too complex to be combined
  • Not fully automate them as we don't need to do this very frequently, yet. Therefore no usage of AWS Glue and the like.
  • Not investing time in big data technologies

# Consequences

We're avoiding the big data envy all together by implementing this on our own. This will allow us to spend our time learning about the data more than the technology.

To mitigate the need to rewrite to Python in the future, we will adopt the patterns of AWS Glue or ETL and make sure that our codebase reflect that patterns. We will be splitting our classes to Extractors, Transformers, and Loaders, and share reusable schemas or helpers within Data Sources.

We will unit test what complex part of our pipeline, which is the algorithm to match buildings across multiple data sets. The unit test will be capturing all of the specifications that we are growing over time meanwhile we're learning the data on the go.