High Availability Server Memory
When you buy a server the primary goal should be to take whatever steps to make sure that the server, as well as its applications and data, is available at all times to your users. This isn't the desktop world where it's OK that Windows crashes every day and you know that so you hit save every few minutes anyway. In the server world, a crash is not just inconvenient for 25+ users, but could also delete (or not save) data to a business system and this could have a dramatic effect on your business.
Servers have done a very good job with data redundancy on a hard drive level for a great number of years now. Even my first Netware 3 server included built in RAID (redundant array of inexpensive disks). No one in their right might would buy a server these days without some sort of RAID set up.
Within the last 5 years or so we've seen tremendous advances in server memory technology. The days of having to down a server and open it up to replace or upgrade memory are drawing to a close. Memory architectures include hot swappable and spare memory modules. This is some pretty cool stuff so I thought I would tell you about it.
The first advance is something called an online spare. This is where you install an extra memory module that just sort of sits there until something goes wrong with an active memory module. Then the system disables the bad active module and starts using the online spare. This will let you keep working and replace the bad module later, but there is the potential that data will be lost when the active chip goes bad.
The next step up is hot swappable memory, which is actually quite a technical feat. There is hot swap mirrored memory and hot swap RAID memory. Hot swap means that you can remove or install memory without shutting the server down. Mirrored memory is when you install twice the amount of memory that you need and the server writes exact copies of the active module’s information directly to the inactive module. This way if the active module goes down the system can fail over to the mirrored module without losing any data.
There is one weakness to mirrored memory which is that if the two memory modules develop physical errors in exactly the same location you would lose data. In a high volume production environment this could be a big deal, but my guess is that such a thing happens extremely rarely. To prevent this from occurring, you can run RAID memory where information and parity information are striped across memory modules. If a memory module goes down and is taken offline, the data and parity information on the other modules can be used to recreate it.
This was (hopefully) all very interesting, but what does it mean to your business? It means that if you choose your server memory technology carefully, then you can minimize downtime. Remember, downtime is a bad thing and it will cost you money. As an example, a long, long time ago I was a budding new network administrator, fresh out of college and still wet behind the ears. Our server was the heart of our operation – everything we did depended on it. One day it started to get flaky (these things usually start vaguely) and then it got worse and worse. I eventually diagnosed the problem as a bad memory module. I had to down the server, take it apart, and remove the memory module. After I bought a new module, I had to go to work on a Saturday so I could down the server and take it apart again during a time when no one needed it. That whole process could have been skipped using any of the advanced memory technologies I describe above.
Just like in real life, you don’t want your server’s memory to go on you.
If you want to read more, HP has a great white paper on high availability memory architectures available from their server page



