Novigo Solutions - Consulting | IT Services

Mar 22 2022 | by Mahammed Nasir

Analyze with:

ChatGPT Perplexity Google AI

A research on best way to archive large data

Few months back I happened to work on SaaS based IOT project, a large project with at least 35+ microservices running on K8s cluster. The application is hybrid & multi cloud application, comprising technologies like .Net Core, python, PostgreSQL, MongoDB, GRPC and many other supporting open-source applications. The application components are hosted on multi cloud environment.

One of the major components in the application was IOT data collection, storage & report generation, encompassing multiple microservices. System was receiving IOT signals from large number of devices in different frequencies. As per the client need the frequency could have been configured for each device from 5 Seconds ~ 60 Seconds, that is (12to1 rec)/min. Everyday system was accumulating massive amount of data on MongoDB system. To cut the cost some of the clients/tenants (in SaaS) did not want to retain these IOT data for very long duration. Hence, we decided to archive IOT data to some cold storage.

I started going through with available options on the market. Following are the options initially came up to me:

HDFS
Bring up another Mongo Cluster for archival.
Store it on AWS S3
Store it on Azure Storage

I ruled out first two options for some obvious reasons. Though AWS S3 could also have been solutions, I still started researching on Azure Storage, since it is my primary area. Table storage was right suited option we chose.

Data size(bytes)/ Record	(KBs)/Record
440	0.44

					Monthly Data Usage in (GBs)/ Below Devices
Data Frequency (In Seconds)	Records Per Minute	Per Day Count	Per Day data size (KBs)	Monthly data Usage for 1 device (MBs)	1000	2000	5000	10000	100000
60	1	1440	633.6	19.008	19.008	38.016	95.04	190.08	1900.8
30	2	2880	1267.2	38.016	38.016	76.032	190.08	380.16	3801.6
20	3	4320	1900.8	57.024	57.024	114.048	285.12	570.24	5702.4
15	4	5760	2534.4	76.032	76.032	152.064	380.16	760.32	7603.2
10	6	8640	3801.6	114.048	114.048	228.096	570.24	1140.48	11404.8
5	12	17280	7603.2	228.096	228.096	456.192	1140.48	2280.96	22809.6

The Azure Storage pricing appeared remarkably simple, I started sizing the data what is accumulating to estimate the monthly cost. Below table gave me clear cut information that 2.2TB is the maximum expected monthly data size for 100K devices.

Next part was about estimating number of read & write operations per month. Since every device had custom data retention days, the solution was to fetch all expiring records, archive it and delete from primary database. There was no read operation expected in near time. Considering per device per record(/device/record) archival, the maximum estimated number of write operations per month was 22,809,600,000 for 100K devices.

The azure estimate showed very high pricing for the above sizes. Primary reason for spike in price was number of writes were very high.

Graphical user interface, text, applicationDescription automatically generated

To reduce the number of writes we decided pass 100 records per API call. That worked! we could reduce number API calls drastically and we saw high reduction in pricing.

Graphical user interface, applicationDescription automatically generated

Data flow block diagram:

DiagramDescription automatically generated

Conclusion:

There are lot of options available on market for the data archival. Earlier days data archival was tedious work, as it involved setting up infra manually & maintaining it. The cloud storage removes the pain of maintaining the archival infra, it also provides multiple advantages like Redundancy, through which the data can be replicated to different regions.

Quick contact info

Analyze with:

A research on best way to archive large data

Data flow block diagram:

Conclusion:

Browse other topics

Contact Us

Send us a message.