Filecoin Filecoin Storage Provider Bootcamp, by ESPA, 2 Mar 2022

Previous Meeting Next Meeting

⏯

youtube image

►

From YouTube: ESPA Module 4A - Engineering Upkeep

Description

Raymond Zhang (VP of Engineering) with PiKNiK discusses the engineering cluster upkeep scripting at the Enterprise Storage Provider Accelerator (ESPA) bootcamp week that took place in February 2022.

Data is growing at an incredible speed and much of this data is archived and/or simply lost by enterprises. Our program will accelerate net-new Enterprise Storage Providers into the ecosystem, using a Web3 protocol with an impressive incentive model called Filecoin.

Learn more at:
Sign-up: https://web3espa.io
Landing Page: https://m.fil.org/espa-bootcamp

Follow ESPA:
Twitter: https://twitter.com/web3ESPA
LinkedIn: https://www.linkedin.com/company/web3espa/

A

In this module I'll be talking about engineering upkeep, so the topics I'll touch on are how to scale things to consider and what to look out for and why it matters.

A

um Managing a small file. Coin operation is easy. Well, it's easier than enterprise level storage. So as enterprise level storage, all the processes and the configurations that you had as a small operation won't transfer. As you scale up past a certain threshold. There is a miner in the community that we know of that. Didn't really configure their setup properly to provide actual storage and the result of that was once they grew once they grew past a certain point.

A

The whole operation broke and it took 10 days to fix um and the result of that was that they lost all the block rewards that they've earned through the penalties of failing the data audit.

A

So, as you get bigger the more audits and the more chances of getting penalized because you have, if you fail them, then you're just going to get slashed window post is the data audit that you're subjected to every 24 hours lose failing. This audit means you'll, be financially penalized and lose the opportunity to win block rewards, which means earning more money. So it's really important to make sure that your systems are healthy and that you're able to perform this audit within the time window, you're allotted and so some things to monitor are resource utilization.

A

The audit process is very resource intensive. It's utilizing, ram, cpu, gpu disk space and io. So if your resources are being overloaded, your system will crash with the high workload results in higher temperatures. So you got to make sure that your system is running within a reasonable temperature fans can fail, which will cause your temperatures to spike, and this can damage your hardware. So it's important to monitor that and finally wallet balances, because the audit requires a message to be submitted to the blockchain to the network.

A

You need to make sure you have enough fail to submit your audit to the blockchain. Even if you pass. Let's say you have all your data. The proof is good, but you're unable to submit it to the network you're going to fail. Windowposts.

A

Some things to optimize is data access, because the audit process is reading data from all your storage medium, it has to read the data, so the faster you're able to read the data, the higher your chances. You can finish the audit within the 30 minute window, you're allotted so as a store enterprise, level, storage provider.

A

You have hundreds, if not thousands, of machines that you need to manage and make sure that they're all working on tasks and coordination with all these different machines. You have compute resources that you want to dynamically allocate to certain tasks. You don't want to waste resources by having them sit and do nothing and ways you can do this is through automation and orchestration tools. That will tell certain machines that aren't working on something to then work on something else.

A

um You also need to perform maintenance on all these systems. So your storage system, your miner, your worker nodes, the most important thing to call out here is the software updates and the security updates. If you're unable to update your system in a timely manner, your systems are vulnerable to attack during that time period. So how do you push security updates to thousands of machines? There's no way to do that, except with automation,.

A

With data storage, you're also going to need to move data around to make space how you store it on the end, you need to be able to do that with across all your systems.

A

And, finally, you need to maintain sync to the blockchain. What this means is the blockchain has thousands of participants, they all maintain the same state, and so if your systems are out of state with the blockchain with the whole network, you won't be able to submit your data audit messages.

A

You won't be able to tell the network that you're going to store this piece of data, so your whole operation comes to a standstill until you get back in sync with the blockchain network and in order to keep 100 uptime, you need to have multiple nodes that are syncing to the blockchain, so just in case one node fails you'll. You can automatically fail over to the next node, that is in sync with the blockchain, so you never have downtime.

A

As a source provider you're going to be importing tons of data petabytes, if not zetabytes, and the only real way to do this to manage this much data at scale is to have some type of automated system to do this. The file coin protocol only limits 64, gibb sectors at the maximum, but most commonly 32gb is used.

A

So if you have files that are larger than this size, the only way to store it on the falcon network is to cut them up into smaller pieces so that you can put them into the sector for storing data on the network. You also need to put up collateral. This incentivizes you to be a good actor on the network.

A

So you are you don't just shut down once you have the money once you've been paid by the customer and also once you get their data, so managing collateral is important as you're storing data as you sort data, your collateral is returned to you over a vesting schedule. So it's slowly returned to you, but also the block rewards that you earned.

A

You don't get all the rewards immediately. That's also an investing schedule. So you need to be able to maintain a natural flow where collateral, where fill is coming in through blockboards and collateral, and then you're utilizing that again to store more data with data storage, you need to have your ceiling cluster. These are your systems that encode the data that's stored in order to be able to audit so based on how fast you can seal, you need to kind of feed your system, the data at a certain rate that your systems can handle.

A

You don't want to feed it too fast, where you're creating a backlog of the data being processed, because there is some there is timing that you need to be considered that you need to consider.

A

If, if all your data is being backlogged, it won't make it on chain onto the network, because, like I mentioned in the previous slide, you're telling the network that I'm going to store this data. But if that data doesn't make it unchained by the time you told the network, you're gonna store it.

A

You would have wasted resources for that. The market node is the gateway to your data pipeline.

A

So when data comes in, it has to be processed by the market, node that spins up copies in order to put into your ceiling pipeline, and so it's important to maintain the flow there as well, where data comes in at a reasonable rate, where you're not running out of space.

A

Data preservation is the core function. As a storage provider data degrades over time, that's inevitable, hard drives will also fail. Data gets corrupted when bits get flipped for whatever reason that just happens. So it's really important that you have data redundancy to continue passing window posts, even though the inevitable happens, and let's say your whole data center breaks down or there's no, you lose power. It's important to have a full data, backup somewhere for disaster recovery, so that you're able to restore the data and continue doing these audits.

A

However, full data backup is really expensive at the enterprise scale. If you have petabytes or zettabytes of data you're not going to have another set of bite or petabyte of data backed up somewhere else, that's double the amount of that's double the cost. So you so. In order to tackle this, you need to ensure data durability through the compute of parity bits. So these are bits that you can store throughout different locations and you're able to rebuild the original data from these bits. But you also need to be proactive about this.

A

You don't want to find out that your data is corrupted during the audit, so you really need to have a process that will scan all the data you have and repair it as necessary.

A

In summary, enterprise storage operations can only really be scaled up and done with automation. You can't you can hire people to do all these things manually, but it's just going to be a never-ending human problem.

A

Data availability and preservation is vital. Like I mentioned, if you don't have the data for audit, if you fail the audit you're going to be penalized until you restore that or until you go bankrupt and finally utilizing resources efficiently by automating based on your needs, your systems can only handle so much workload, and you also don't have unlimited fill to post collateral. So you need to figure out the most optimal way where your system's running smoothly and continuously.

A