The Adventures of Systems Boy!

Confessions of a Mac SysAdmin...

Networked Home Accounts and The New RAID

We recently installed in our machine room a brand-spankin' new RAID for hosting network home accounts. We bought this RAID as a replacement for our aging, and horrendously unreliable Panasas RAID. The Panasas was a disaster for almost the entire three-year span of its lease. It used a proprietary operating system based on some flavor of *NIX (which I can't recall right at this moment), but that had all sorts of variations from a typical *NIX install that made using it as a home account server far more difficult than it ever should have been. To be fair, it was never really intended for such a use, but was rather created as a file server cluster for Linux workstations that can be easily managed directly from a web browser, as opposed to the command-line. It was really built for speed, not stability, and it was really completely the wrong product for us. (And for the record, I had nothing to do with its purchase, in case you're wondering.)

What the Panasas was, however, was instructive. For three years we lived under the shadow of its constant crashing, the near-weekly tcp dumps and help requests to the company, and angry users fed up with a system that occasionally caused them to lose data, and frequently caused their machines to lock up for the duration of a Panasas reboot, which could be up to twenty minutes. It was not fun, but I learned a lot from it, and it enabled me to make some very serious decisions.

My recent promotion to Senior Systems Administrator came just prior to the end of our Panasas lease term. This put me in the position of both purchasing a new home account server, and of deciding the fate of networked home accounts in the lab.

If I'd learned anything from the experience with the Panasas it was this: A home account server must be, above all else, stable. Every computer that relies on centralized storage for home account serving is completely and utterly dependent on that server. If that server goes down, your lab, in essence, goes down. When this starts happening a lot, people begin to lose faith in a lot of things. First and foremost, they lose faith in the server and stop using it, making your big, expensive network RAID a big, expensive waste of money. Secondly, they lose faith in the system you've set up, which makes sense because it doesn't work reliably, and they stop using it, favoring instead whatever contingency plan you've set up for the times when the server goes down. In our case, we set up a local user account for people to log into when the home account server was down. Things got so bad for a while that people began to log in using this local account more than they would their home accounts, thus negating all our efforts at centralizing home account data storage. Lastly, people begin to lose faith in your abilities as a systems administrator and lab manager. Your reputation suffers, and that makes it harder to get things done — even improvements. So, stability. Centralization of a key resource is risky, in that if that resource fails, everything else fails with it. Stability of crucial, centralized storage was key if any kind of network home account scenario was going to work.

The other thing I began to assess was the whole idea of networked home accounts themselves. I don't know how many labs use networked home accounts. I suspect there are quite a few, but there are also probably a lot of labs that don't. I know I've read about a lot of places that prefer local accounts that are not customized and that revert to some default state at every log in/out. Though I personally really like the convenience of customized network home accounts that follow you from computer to computer throughout a facility, it certainly provides a fair amount of hassle and risk. When it works it's great, but when it doesn't work, it's really bad. So I really began to question the whole idea. Is this something we really needed or wanted to continue to provide?

My ultimate decision was intimately linked to the stability of the home account server. From everything I've seen, networked home accounts can and do work extremely well when the centralized storage on which they reside is stable and reliable. And there is value to this. I talked to people in the lab. By and large, from what I could glean from my very rudimentary and unscientific conversations with users, people really like having network home accounts when they work properly. When given the choice between a generic local account or their personalized network account, even after all the headaches, they still ultimately prefer the networked account. So it behooves us to really try to make it work and work well. And, again, everything I saw told me that what this really required, more than anything else, was a good, solid, robust and reliable home account server.

So, that's what we tried our best to get. The new unit is built and configured by a company called Western Scientific, which was recommended to me by a friend. It's called the Fusion SA. It's a 24-bay storage server running Linux Fedora Core 5. We've populated 16 of the bays with 500GB drives and configured them at RAID level 5, giving us, when all is said and done, about 7TB of networked storage with room to grow in the additional bays should we ever want to do so. The unit also features a Quad-port GigE PCIX card which we can trunc for speedy network access. It's big and it's fast. But what's most important is its stability.

Our new RAID came a little later than we'd hoped, so we weren't able to test it before going live with it. Ideally, we would have gotten the unit mid-summer and tested it in the lab while maintaining our previous system as a fall-back. What happened instead was that we got the unit in about the second week of the semester, and outside circumstances eventually necessitated switching to the new RAID sans testing. It was a little scary. Here we were in the third week of school switching over to a brand new but largely untested home account server. It was at this point in time that I decided, if this thing didn't work — if it wasn't stable and reliable — networked home accounts would become a thing of the past.

So with a little bit of fancy footwork we made the ol' switcheroo, and it went so smoothly our users barely noticed anything had happened. Installing the unit was really a simple matter of getting it in the rack, and then configuring the network settings and the RAID. This was exceptionally quick and easy, thanks in large measure to the fact that Western Scientific configured the OS for us at the factory, and also to the fact that they tested the unit for defects prior to shipping it to us. In fact, our unit was late because they had discovered a flaw in the original unit they had planned to ship. Perfect! If that's the case, I'm glad it was late. This is exactly what we want from a company that provides us with our crucial home account storage. If the server itself was as reliable as the company was diligent, we most like had a winner on our hands. So, how has it been?

It's been several weeks now, and the new home account server has been up, without fail or issue, the entire time. So far our new home account server has been extremely stable (so much so that I almost forget about it, until, of course, I walk past our server room and stop to dreamily look upon its bright blue drive activity lights dutifully flickering away without pause). And if it stays that way, user confidence should return to the lab and to the whole idea of networked home accounts in fairly short order. In fact, it seems like it already has to a great extent. I couldn't be happier. And the users?... Well, they don't even notice the difference. That's the cruel irony of this business: When things break, you never hear the end of it, but when things work properly, you don't hear a peep. You can almost gauge the success or failure of a system by how much you hear about it from users. It's the ultimate in "no news is good news." The quieter the better.

And 'round these parts of late, it's been pin-drop quiet.

Labels: , , ,

« Home | Next »
| Next »
| Next »
| Next »
| Next »
| Next »
| Next »
| Next »
| Next »
| Next »

11:27 AM

You speak the truth, brother. No news is good news... You want them to thank you when it's working, but they don't. If they can get their work done, without the system interefering, then it's golden. Of course when things break, we hear about it. We sit around waiting for things to fail. We are the evil maytag repair men from a parallel universe. Or something like that. Anyway, keep up the good blogging. I managed to sneak some geek talk on my blog, in between all the baby pics. Some reverse ssh tunnelin fun. http://smurftower.kicks-ass.org/blag/index.php?entry=entry061023-120102    



4:39 PM

Hey, nice blog.

Yeah, this can be a thankless job. But I do get pretty happy when the knocks on the door are fewer and farther apart. Maybe users don't notice that everything's running smoothly, but I can tell the difference. Warms the cockles of my heart.

Anyway, nice instructions in your post. I should try this some day. I've never done it. Just haven't ever had the need. And I'm very, very lazy.

Cheers!

-systemsboy    



10:46 PM

Thank you for the post, I just contacted Western Scientific today about a RAID system. Besides the RAID 5 for the network home directories, do you also do off-line (ie. tape) backups? Have you heard of iSCSI? How can I contact you directly?

Thanks and keep up the great work!    



2:28 AM

Ben,

We don't have a tape backup. Our philosophy is that, since we're dealing with student work, the students are responsible for backing up their work. If we were storing multi-million dollar TV projects for big clients, we'd certainly have a tape backup. But we'd have the budget for it too. Still, someday we might have the money to get one. I have looked into them, though, but I can't recall offhand what I liked.

We looked into iSCSI too, but rejected it. We wanted something very straightforward and industry standard, and iSCSI looked a bit too experimental for our tastes. Again, it all comes down to reliability for us. We didn't quite understand iSCSI — how it worked or its limitations — and we weren't really willing to try another experiment like our Panasas RAID.

BTW, I hate to tempt fate, but our Western Scientific RAID has had 100% uptime since we went live with it. It's been at least a month-and-a-half, and I'm in hog heaven. It is a godsend, and I highly recommend it thus far.

I don't generally give out my email here on the site, but if you want to contact me through the comments that would be fine. I'm pretty good about responding. If you really want to email directly, leave me your email and I'll contact you.

Good luck with your RAID purchase!

-systemsboy    



1:40 PM

I know this is a very old post. but I wanted to chime in.
I usually like FC* linux, but for the files server that you are dealing with it might be wise to use CentOS 4 or now 5. The problem with FC5 is the lack of bleeding edge driver support for the HW RAID cards in in the fusionsa. One run of up2date wiht a kernel update and you will be hosed with no mounted raid.

just be careful    



12:01 PM

Pgreer,

Thanks for the info. I'll bear that in mind. We'll probably not update that OS for a really long time, if ever. Seems to be working, and I'm from the "If it ain't broke, don't fix it" school. But if we ever do, we'll surely look at Cent. I keep hearing about it. Maybe the next big Linux thing?

-systemsboy    



» Post a Comment