[afnog] Self induced Mail Server (POP) crash and recovery.

Ismail M. Settenda ismail at habari.co.tz
Thu Nov 19 10:01:20 UTC 2009


Hi guys,

Check this out,

I was doing maintenance on a FreeBSD 7.2 mail server and to create file
space I was trying to delete a folder /home/ismail/src (when on the phone)
and instead accidentally deleted the /usr directory. Now this literally
crippled the server as almost all programs run from folders or scripts in
the /usr folder. The immediate concern at the time was to get the MTA
running again.

The immediate strategy was to recover the files by;

   1. Shutdown server and use the second raid disk as it was running raid 1.
   2. Use the installation CD to try copy the files or re-install the broken
   programs
   3. Copy files from a similar mail server's /usr directory.

Option 3 proved to be the most usable as the raid had already passed on the
delete to the second disk. The second option was also futile as it could
only copy or repair by reinstalling. Unfortunately the the copy process was
quite slow and using a portable flash took almost 6 hours to complete a 6Gb
copy. I suspect the reason was the usb 1 port on one of the mail server(s).

After the copy the machine rebooted fine and came back online but most
programs ofcourse wouldnt run properly i.e different packages. Re-install of
the packages became impossible and as we had a backup of the some of config
files at the time thought it meant would have to delete the package
directory and reinstall everything anyway.  So decided to re-install with
the CD and just replace the entire new /usr directory. Unfortunately when we
did this it overwrote files in / partition as well namely /etc/passwd
/etc/group which surprise surprise later turned out we didnt have afresh
copy in the backup.

So after the fresh O.S install did fresh installs of all the services that
were running before copied over the backup conf files and the MTA was fixed
but related programs like mailman and webmail had further issues such as
versions mismatch, lock files lib and package inconsistency (seems
portupgrade and make install in the package directory install totally
different dependency packages). Took a total of 72 hours to get the server
backup to fair running condition.

So I am curious after all this;

   1. Is it that easy to cripple a server and what steps does one take to
   avoid it (lets ignore unauthorized entry for now).
   2. Could the immediate recovery strategy and decisions have been handled
   better?
   3. Any ideas on how to avoid this in future and how to recover better?

--
Ismail
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://afnog.org/pipermail/afnog/attachments/20091119/f9cb6a64/attachment.htm>


More information about the afnog mailing list