Monday, July 26, 2010

Cloud Computing Models -- What is the future

I was reading through the first chapter of “The Cloud at Your Service” and come to the section that was describing the different types of clouds available.

The three types that they discussed were:

  1. IaaS -- Infrastructure as a service. An example is Amazon EC2. In this model you are provided with access to the bare metal of the machines. You provision the operating systems, installing the applications that you need to run on each type of machine. That way you define the virtual image that you want to run for your app servers, your database servers, etc.
  2. PaaS -- Platform as a Service. This is the model of Google App Engine. You don’t provision the operating systems, but you can only use the languages that they support on the platform.
  3. SaaS -- Software as a Service. This model is where you are subscribing to software that will work for you. It may or may not have the ability to extend the software.

I’m trying to decide which way would make the most sense in the future. I think if we want to utilize the systems most efficiently we may need to look at the PaaS model more. This way developers must understand that the systems that they are making will be used on other machines. But we would also need a way to run the platform locally. To enable developers to test applications in the small scale before putting them up to the cloud.

I think that if a platform was created that encouraged developers to think of problems in a more cloud friendly way, it would scale easier. However the platform would need to handle the most common reasons for performance issues. The major issue that comes up when scaling systems is accessing the database. This could be minimized if the platform allowed the code that needed the data to move to increase the locality of the time and place that the data was needed.

These are just some of my ideas that I needed to get out.

Saturday, July 17, 2010

Distributed Computing

I am thinking about how to do fault tolerant distributes computing.

I think the concept that would make this feasible would be verified functional coroutines. Where there are no side effects that are globally witnessed.

I think this would need to have command nodes that schedule what coroutine is executed on which machine in the cluster. I would want the command nodes to be redundant, so that if one of them failed the computation would still complete. The command node would need to know what each node could process, what processing they specialize in, ex. GPUs specialize in SIMD classes of problems, so if we had a piece that was calling coroutines in that way, we could run them on a GPU focused machine rather than a normal general purpose machine.

The command node would need real time statistics to understand the cost of communication to the nodes, so it could intelligently schedule what would be queued on each machine, and when to shift a coroutine to a machine that is less able to perform the task, but would finish sooner than one that has a longer wait time.

Each coroutine would need to have it’s estimated run time and estimated communication size. Then we could make intelligent decisions as to when to move the computation to a different machine, and when to just run the computation on the machine that needs the result, and when to run the computation on a machine that is faster and transmit the result back.

Every routine has the time that it should take, and an timeout limit. If the result hasn’t returned before the timeout has been reached the command node sends the routine to a different node and marks the node as unresponsive. Once it receives a response from the node, it asks for it’s current load and network traffic to properly rank the nodes.

What we would need in this system is a language that was composed of verified coroutines, that way we knew exactly what each one did in the time between when it started and when it yielded. It would also have to have the size of communication as a function of the input size, and the estimated complexity to know approximately how long it would take to finish the stage of the coroutine.

They would have to be functionally programmed, because then we wouldn’t worry about any side effects. We would still have to think through deadlock scenarios, but I think that would be possible to analyze the code as a compilation step.