A research on best way to archive large data
Few months back I happened to work on SaaS based IOT project, a large project with at least 35+ microservices running on K8s cluster. The application is hybrid & multi cloud application, comprising technologies like .Net Core, python, PostgreSQL, MongoDB, GRPC and many other supporting open-source applications. The application components are hosted on multi cloud environment.
One of the major components in the application was IOT data collection, storage & report generation, encompassing multiple microservices. System was receiving IOT signals from large number of devices in different frequencies. As per the client need the frequency could have been configured for each device from 5 Seconds ~ 60 Seconds, that is (12to1 rec)/min. Everyday system was accumulating massive amount of data on MongoDB system. To cut the cost some of the clients/tenants (in SaaS) did not want to retain these IOT data for very long duration. Hence, we decided to archive IOT data to some cold storage.
I started going through with available options on the market. Following are the options initially came up to me:
- HDFS
- Bring up another Mongo Cluster for archival.
- Store it on AWS S3
- Store it on Azure Storage
I ruled out first two options for some obvious reasons. Though AWS S3 could also have been solutions, I still started researching on Azure Storage, since it is my primary area. Table storage was right suited option we chose.
| Data size(bytes)/ Record |
(KBs)/Record |
| 440 |
0.44 |
| |
|
|
|
|
Monthly Data Usage in (GBs)/ Below Devices |
Data Frequency
(In Seconds) |
Records
Per Minute |
Per Day
Count |
Per Day
data size (KBs) |
Monthly data Usage
for 1 device (MBs) |
1000 |
2000 |
5000 |
10000 |
100000 |
| 60 |
1 |
1440 |
633.6 |
19.008 |
19.008 |
38.016 |
95.04 |
190.08 |
1900.8 |
| 30 |
2 |
2880 |
1267.2 |
38.016 |
38.016 |
76.032 |
190.08 |
380.16 |
3801.6 |
| 20 |
3 |
4320 |
1900.8 |
57.024 |
57.024 |
114.048 |
285.12 |
570.24 |
5702.4 |
| 15 |
4 |
5760 |
2534.4 |
76.032 |
76.032 |
152.064 |
380.16 |
760.32 |
7603.2 |
| 10 |
6 |
8640 |
3801.6 |
114.048 |
114.048 |
228.096 |
570.24 |
1140.48 |
11404.8 |
| 5 |
12 |
17280 |
7603.2 |
228.096 |
228.096 |
456.192 |
1140.48 |
2280.96 |
22809.6 |
The Azure Storage pricing appeared remarkably simple, I started sizing the data what is accumulating to estimate the monthly cost. Below table gave me clear cut information that 2.2TB is the maximum expected monthly data size for 100K devices.
Next part was about estimating number of read & write operations per month. Since every device had custom data retention days, the solution was to fetch all expiring records, archive it and delete from primary database. There was no read operation expected in near time. Considering per device per record(/device/record) archival, the maximum estimated number of write operations per month was 22,809,600,000 for 100K devices.
The azure estimate showed very high pricing for the above sizes. Primary reason for spike in price was number of writes were very high.
To reduce the number of writes we decided pass 100 records per API call. That worked! we could reduce number API calls drastically and we saw high reduction in pricing.
Data flow block diagram:
Conclusion:
There are lot of options available on market for the data archival. Earlier days data archival was tedious work, as it involved setting up infra manually & maintaining it. The cloud storage removes the pain of maintaining the archival infra, it also provides multiple advantages like Redundancy, through which the data can be replicated to different regions.