Analyzing data using serverless technologies

Article

Neuroscience

Analyzing data using serverless technologies

Darius Bogdan

15/6/2021

min read

Client

Location

Platform

Team

Event Type

Date And Time

Organizer

Hosted By

Location

Guest

No items found.

Podcast

Hosted By

No items found.

Key Takeaways

The word serverless became a hot topic in the world of Computer Programming. You might have heard the word Serverless a couple of times, either by going to conferences or by talking with other people who work in the digital development industry.

What will we learn today?

When to use serverless functions?
How to create a data processing pipeline?
How to use Google Cloud technologies in order to process data?

We chose to use Google as our cloud provider, although everything presented in this article can be achieved using other cloud providers like Amazon, Azure, and so on.

What are we going to build?

In this article, we will see how we can take advantage of serverless functions in order to build a Processing Data Pipeline for analyzing and processing data.

Let’s imagine that we are working at a tech company and every couple of weeks we receive files that contain information about issues (tasks) from a variety of projects. Our managers look from time to time into our application where they want to see statistics from all projects.

The project managers look every month to see what the status of the projects is, such as seeing the number of issues that were done in total from when the project was started and the number of story points done on that project so far. Sometimes they also want to see all the issues that were not of type bugs and were finished when the file was received.

In order to fulfill their needs, we are going to build a pipeline that filters and aggregates the data they are interested in.

Why are serverless technologies good in this case?

Single event that starts our processing pipeline.
Server not running 24/7.
Small functions with a single purpose.
Paying only while running.

The pipeline:

Upload the file into the application.
Upload the data into a data warehouse.
Filter the data we uploaded and put that into another table.
Aggregate the data and update the statistics.

In this article, we will see how we can implement a processing data pipeline using Google technologies. The same concept applies to any Cloud Provider that has Serverless technologies.

Technologies stack:

Google Cloud Functions — serverless functions used to process the data.
BigQuery — data warehouse.
Node.js 8 — as our programming language.

We are going to present the technologies we are going to use and then see how we can build this pipeline.

Serverless functions

Serverless functions are isolated functions that have only one purpose. Keeping this in mind, we can think about our functions as being a black box with an input and an output.

Serverless functions shouldn’t be the replacement of a REST API, they should be additions to the main API that have a single, isolated dedicated purpose.

A good example would be when uploading a file into our system, we want to apply some filtering and do some calculations on that file.

There are multiple types of events that can trigger a serverless function. The types that we are going to use today are:

Google Cloud Storage — this type of event triggers when we upload a file into google cloud in a bucket that we specify.
Pub/Sub Triggers — this will allow us to communicate between our functions.

For communicating between our functions, we can have two approaches using services from Google:

Cloud Tasks — this is a service that allows us to manage distributed tasks. We can use this to trigger the next function of our pipeline after a function finishes. All we need to do is to create a Cloud Task which will call our HTTP Serverless function.
Pub/Sub — this is a real-time messaging service that allows us to trigger a function that listens to a topic. In the example we are building today, we are going to use this technology.

BigQuery

BigQuery is a serverless data warehouse. In BigQuery the data is organized in datasets and tables.

We are going to use this service for storing our data and analyzing it.

In our system we are going to upload CSV files that have the following structure:

Pipeline implementation

Upload the data to BigQuery

Our first serverless function will get the data from the file uploaded to Cloud Storage and upload it to BigQuery.

Trigger Type — Google Cloud Storage Finalize

This will be triggered when the file was uploaded successfully into our storage. This type of function will receive as parameters:

data — the event payload.
context — the event metadata.

A cool thing when working with multiple services from the same cloud provider is that we do not need to authenticate the services we are working with because they are automatically authenticated when deployed in the cloud.

In this function, we are getting the reference of the file that we just uploaded and we load its contents into a BigQuery table.

The function ‘bigQuerySafeName’ creates a table name from the file name that respects the following conditions:

Contain up to 1,024 characters.
Contain letters (upper or lower case), numbers, and underscores.

After we loaded the data into the table we published a message on the filter-uploaded-data in order to trigger the second function from our pipeline.

Deploying the function

We just wrote our first function, now all we need to do is deploy it in the cloud. Google provides a cli that can be used for achieving this.

With this command, we say that we want to deploy a function named upload-to-bigquery in the region us-central1.

The name of the function (entry point) in our application is uploadToBigQuery. We want this function to trigger when a new file finishes uploading in the bucket ‘projects_files’. We give our function the maximum memory allowed by Google (which is 2GB) and we specify the maximum amount of time our function is allowed to run (which is 9 minutes).

Filter the data

We are receiving .csv files that contain a lot of data. Our analytics team is interested to see all the issues that are done and that are not bugs.

This brings us to our second function of the pipeline which filters the data and saves it into another table in our data warehouse. The function listens to a Pub-Sub topic and when a message it’s published, the function is run automatically.

BigQuery allows us to run queries and save the results into a table. The main thing that we are doing here is making a query, running an asynchronous job, and waiting for the results. After the results are returned, we are publishing a message on the update-final-table topic to trigger the last step of our pipeline.

Update the final data

The last step we want to do in our pipeline is to update the data in the last table where we keep the number of issues done and the number of story points done from when the project began.

This function runs a Job to update the data in the final BigQuery table.

Conclusion

Serverless technologies can be used in multiple cases and have many benefits from cost reduction to using small, readable chunks of code. As shown today in our implementation, the serverless functions should be small, isolated functions with a single purpose. With serverless functions we can also create complex data processing pipelines, taking data processing further than ever before.

Immerse yourself in a world of inspiration and innovation – be part of the action at our upcoming event

JOIN
NOW

Download
the full guide

DOWNLOAD

Darius Bogdan

Darius Bogdan has over 7 years of expertise in Cloud computing. He is certified as Google Cloud Architect and he has been leading multiple development teams, as Tech Lead.

‍

As Head of Backend at Linnify, he is in charge of the management of the department from day-to-day activities of the engineering to management and execution against delivery commitments.