I am thinking about how to do fault tolerant distributes computing.
I think the concept that would make this feasible would be verified functional coroutines. Where there are no side effects that are globally witnessed.
I think this would need to have command nodes that schedule what coroutine is executed on which machine in the cluster. I would want the command nodes to be redundant, so that if one of them failed the computation would still complete. The command node would need to know what each node could process, what processing they specialize in, ex. GPUs specialize in SIMD classes of problems, so if we had a piece that was calling coroutines in that way, we could run them on a GPU focused machine rather than a normal general purpose machine.
The command node would need real time statistics to understand the cost of communication to the nodes, so it could intelligently schedule what would be queued on each machine, and when to shift a coroutine to a machine that is less able to perform the task, but would finish sooner than one that has a longer wait time.
Each coroutine would need to have it’s estimated run time and estimated communication size. Then we could make intelligent decisions as to when to move the computation to a different machine, and when to just run the computation on the machine that needs the result, and when to run the computation on a machine that is faster and transmit the result back.
Every routine has the time that it should take, and an timeout limit. If the result hasn’t returned before the timeout has been reached the command node sends the routine to a different node and marks the node as unresponsive. Once it receives a response from the node, it asks for it’s current load and network traffic to properly rank the nodes.
What we would need in this system is a language that was composed of verified coroutines, that way we knew exactly what each one did in the time between when it started and when it yielded. It would also have to have the size of communication as a function of the input size, and the estimated complexity to know approximately how long it would take to finish the stage of the coroutine.
They would have to be functionally programmed, because then we wouldn’t worry about any side effects. We would still have to think through deadlock scenarios, but I think that would be possible to analyze the code as a compilation step.
No comments:
Post a Comment