Author Topic: Fault Tolerant: LinuxPPC & A1Lite (Read 1917 times)

asian1 · « **on:** August 15, 2003, 02:33:03 PM »

Hello
I have idea about creating Fault Tolerant server
using 3 mini ITX A1Lite board, Linux PowerPC
realtime (hardened kernel) and Dual redundant
power supply unit (2 x 500 watt).
The first board is the main board, the second
a backup and the third is the supervisor board.
If Supervisor failed, backup became supervisor.
It is possible to remove a board (turned off
first) without shutting down the whole system.
Is this idea possible?

Real Time

fnord · « **Reply #1 on:** August 15, 2003, 03:07:41 PM »

hm, actually I'd suppose if you're seriously going to try it out and experiment with it, you'll be better of with the Pegasos. You know, the A1 Lite isn't there yet, so, using a Pegasos, you'll be able to get your feet wet and try it. And you have to admit, that you (At least) shouldn't have any drawbacks due to using a Pegasos. The only advantage of an A1 would be the possibility to use AmigaOS4, but you're not going to build a redundant server with AmigaOS anyway? And why would you need 2x500W. Are you really looking into doing this or is it more like just an nice idea?

Floid · « **Reply #2 on:** August 15, 2003, 03:39:49 PM »

Quote

asian1 wrote:
Hello
I have idea about creating Fault Tolerant server
using 3 mini ITX A1Lite board, Linux PowerPC
realtime (hardened kernel) and Dual redundant
power supply unit (2 x 500 watt).
The first board is the main board, the second
a backup and the third is the supervisor board.
If Supervisor failed, backup became supervisor.
It is possible to remove a board (turned off
first) without shutting down the whole system.
Is this idea possible?

Real Time

Hmm. A high-availability (fault-tolerant) cluster should certainly be possible with the hardware. However, you're barking up the wrong tree - "realtime" refers to the timing guarantees of the kernel and its scheduler, a-la AmigaOS or QNX (will a syscall or whatever be serviced within N us/ms on X hardware?), *not* the fault-tolerance of the system as a whole. Some of the realtime projects might have high-availability solutions, but it's a separate problem domain.

Aside from that, the topology you're talking about is a little suboptimal; with one supply (even redundant), and shared disks?, it only really protects you from mainboard failures, which are actually somewhat rare... So moving to independent machines, with independent supplies, would probably be equally or better maintainable. (Think about it; if half the redundant supply fails, and you've had to modify it to support 3 boards, that's going to be a bitch to swap out, right? While replacing one machine in its own rack or case is comparatively 'easy.')

That said, I've seen projects to hook an ATX supply in this manner, but I can't find them. (You'll either want to give one machine control over the ATX power on signal - oops, single point of failure again! - or wire it 'on' all the time - in either case, no ability to shut down just a single machine if someone calls you up and tells you it's on fire.)

Check out the Linux High Availability Project for more information, and some examples of their idea of reliable designs. There's an old HOWTO with some example configurations, but if you follow them to the letter, keep in mind that the 'Y' topology used on their SCSI chain is really more of a 'V' - and that any modern/popular SCSI card that runs the bus 'through' the card (internal and external connectors) will probably create a nice big stub on the chain electrically, putting you at *greater* risk of data corruption. If you need to share disks like that, FC-AL is probably a better/more robust solution until SAS is out?

Nothing's perfect. Best of luck with it!

Floid · « **Reply #3 on:** August 16, 2003, 01:08:49 AM »

Someone brought up the existence of Linux 'network block device' solutions on Kerneltrap, which might come in handy if you're serious about doing this and trying to solve the 'disk' aspect. See that node in the comments thread. Again, point is that a dedicated monitor (vs. a 'peer to peer' arrangement) has a single point of failure in the monitor, so the question is whether you want redundant hardware (fault-tolerance/high-availability) or 3x the capacity (3 machines doing useful work) with the risk of downtime.

Depends on the service and all, too... Chances are you need a load-balancer or DNS server to make the 'high-availability' switchover(s) work seamlessly, and how to make *that* not a single point of failure is anyone's guess. (Maybe all nodes can share one IP address, with firewall rules on the two 'spares' blocking until one machine's heartbeat or equivalent signal dies.. Throwing that third machine in there gives you more contention to worry about, if you're really dead-set on the idea of a triple-redundant system.)

Now, if you're all about high-availability *computation,* you might want to look into the 'voting' arrangements that, say, the Space Shuttle's computers use. Here's a link with far too much detail on that.

And yeah, I'm the 'Anonymous' in that thread wondering if iSCSI would/could/should be the end-all be-all of the network block device solutions everyone in open-source land is presently homegrowing.

Author Topic: Fault Tolerant: LinuxPPC & A1Lite (Read 1917 times)

asian1

Fault Tolerant: LinuxPPC & A1Lite

fnord

Re: Fault Tolerant: LinuxPPC & A1Lite

Floid

Re: Fault Tolerant: LinuxPPC & A1Lite

Floid

Re: Fault Tolerant: LinuxPPC & A1Lite