Discover more from Machine learning at scale
#27 Gensyn: Decentralised Compute for Machine Learning. The solutions. [Part 2]
Table of contents
How to design such a system?
In the previous article, I described a protocol for distributed Machine Learning algorithms called Gensyn.
I had outlined some difficulties in setting up such a system and in today's article I am going to discuss how to come up with (some) solutions.
Let's get started!
How to design such a system?
I am going to focus on the following problems that need to be addressed:
Work verification: how do you know the weights being provided are not just random numbers?
How to set up the right incentives for people to participate?
How to deal with privacy issues? Most companies can't just share data freely.
In , it has been shown that metadata from gradient-based optimization processes can be used to construct certificates of work performed, which can be verified quickly through replication of certain stages. The idea is based on the fact that entropy growth during training advantages the verifier of the training step.
The paper uses some cool mathematical techniques, if you are into verification proofs, machine learning and entropy based intuitions it is the paper for you.
Different actors to have the right incentives
There are four main participants in the network:
Submitters are the end-users of the system, providing tasks that will be computed and paying for units of work completed.
Solvers are the main workers of the system, performing the model training and generating proofs to be checked by Verifiers.
Verifiers are key to linking the non-deterministic training process to a deterministic linear computation, replicating portions of the Solvers’ proofs and comparing distances with expected thresholds.
Whistleblowers are the final line of defence, checking Verifiers’ work and challenging in the hope of receiving a jackpot payout.
When submitting a task, an estimate of required work is generated by constructing and unrolling a computational graph into the required operations.
The transaction fee paid by the Submitter can then use this estimate, with any excess (e.g. due to pessimistic profiling) returned to the Submitter after computation. Crucially, unrolling the graph requires set limits to be placed on logic which can trigger the halting problem. Tasks form the smallest quantity of ML work that can be pushed to the protocol. Using parallelisation, larger computational workloads can be split into sets of tasks and pushed to the network asynchronously.
Verifiers will periodically grab profiling tasks and generate variation thresholds for proof-of-learning comparisons. To generate a threshold, a Verifier will deterministically run and re-run portions of the training with different random seeds, generating and checking their own proofs. In doing this, the Verifier will build up an aggregate expected distance threshold that can later be used as a threshold to validate the non-deterministic work of the Solvers.
In order to ensure the honesty of the Verifiers when generating the distance thresholds, Whistleblowers are expected to re-run the profiling work and challenge Verifiers where appropriate.
Following verification of the proof-of-learning, Whistleblowers can replicate Verifier work in order to check that the verification work itself has been performed correctly. In the event that a Whistleblower believes that verification has been performed incorrectly (maliciously or not) they can challenge the Verifier to contract arbitration in order to receive a reward. This reward can come from Solver and Verifier deposits in the case of a true positive or from the jackpot treasury in the case of a false positive.
In practice, this means that Whistleblowers are expected to join and leave the network depending on the number of other active (i.e. with live deposits and challenging) Whistleblowers.
Therefore, the expected default strategy for any Whistleblower is to join the network when there are a low number of other Whistleblowers, post a deposit, randomly choose an active task, and begin their verification process. Following the conclusion of the first task, they would grab another random active task and repeat until the number of Whistleblowers increases above their determined payout threshold, whereupon they would leave the network (or more likely, switch to performing another role in the network--Verifier or Solver--depending on their hardware capabilities) until the situation reverses again.
For privacy-preservation, models can be constructed using secure mapping layers as proposed by . The idea behind secure mapping layers is that input features are projected in a new space as a function of the hash of the previous block that made the computation. Plus the publicly accessible training data is encrypted.
In this way, models can be trained on ciphertext with a small accuracy penalty: usually less than 0.5%.
While I still think that solving distributed training with this is a very bold proposal, reading the proposed solutions based on very recent literature was very illuminating and challenged what I believed to be completly impossible.
I hope you enjoyed this not-so-usual article. Let me know if you enjoy this different style in the comments or on LinkedIn :).