jann
Newbie
Offline
Posts: 12
|
 |
« Reply #20 on: July 01, 2010, 09:56:49 PM » |
|
This has to be a big one for them. I've been here since Ventures on line days too, and this is one of a kind. Lets be supportive, they just don't know. This can happen to anyone, disk drives are a week link no doubt.
|
|
|
|
|
Logged
|
|
|
|
|
Adel
|
 |
« Reply #21 on: July 01, 2010, 11:08:35 PM » |
|
I'm a long term (VO days) client also and this is the first prolonged issue we have had. Our clients, unfortunately, don't seem to care about that today.
I wonder if it would have been possible to have at least email up and running more quickly. For many of our clients email is the major issue.
|
|
|
|
|
Logged
|
|
|
|
|
junaid@payandgo.biz
|
 |
« Reply #22 on: July 02, 2010, 12:56:25 AM » |
|
hi
whats going on now for server activation?yesterday whole day disturbed of my customers today is also and they are shouting at me now a lot.please activate the services to release my frustration.post status plz
|
|
|
|
|
Logged
|
|
|
|
ecs
Newbie
Offline
Posts: 2
|
 |
« Reply #23 on: July 02, 2010, 01:04:08 AM » |
|
Is it possible to know what is going on with mail that is unable to be delivered to defender? Is there some arrangement in place where the undelivered mail is being queued on a secondary MX server, ready to be delivered when defender comes back online? Or is it just waiting to time out as being undelivered for XX hours?
|
|
|
|
|
Logged
|
|
|
|
|
john
|
 |
« Reply #24 on: July 02, 2010, 01:14:44 AM » |
|
Hi,
First I want to thank everyone for their patience. This has taken much longer than we anticipated and I thought I would provide some more details on the time line and what activities were being done. Since most of you are long term customers you may know that my personal policy is to be very open with details where appropriate and to provide honest answers.
For about the first 5 hours of the outage the data center was attempting to get the old server running. This involved the running of a "fsck", which is a file system integrity check where the server scans the disks and repairs the errors. In this case the "fsck" was not able to repair the errors to the point the server would boot. Later we learned there were issues with the hard drives that were the likely cause of the disk errors that triggered the kernel panics.
At that point we knew we had to restore the server onto new hardware. We actually had 2 complete servers identical to defender in the rack. These servers were empty, meaning they had no OS installed or any software of any type installed on them. They were actually scheduled to get their OS installed before end of business today and were planned to be targets for the remaining migrations we are doing. But at this point they immediately became targets for restoring defender. Also, from a long term standpoint we always maintain a spare server. Thats just good business. You have to be prepared for the unexpected.
As part of the migrations we have changed to use a new backup system (CDP from R1Soft). This system is far more efficient than the old system (Avamar) that was provided by the data center and used to backup the old servers. Nightly backups now take 30 to 45 minutes instead of the 4 to 5 hours and are far less intrusive on server performance. A feature of that backup software is something called a "bare metal restore (BMR)", which basically will restore a server completely from the backups. No need to install the OS, install Cpanel... everything is restored. Given the (apparent) simplicity of a bare metal restore and the fact that we had servers there empty and available we started down the BMR path, but since we had never done a BMR restore it took a lot of starts and stops. The vendor strongly recommends testing before having to do it for real. They are absolutely right. We spent 6-7 hours tying to get everything talking to each other so the BMR restore could be completed. We never got it to work and when we discovered there was a showstopper issue involving disk partition sizes, we knew we had to find another option for restoring this server. 7 more hours wasted, and to me, extremely frustrating to have wasted that time.
At that point we took the other server that was in the rack and had the data center install the OS, we installed Cpanel, added it to our backup cluster and prepared it to accept the restore data. That actually took about 6 hours, and finally we had a server that was ready for restores, but we were about 20 hours in at that point and we were just to the point of being able to restore data.
Now, the announcements and messages on the IVR were probably overly optimistic and as we had set backs may look like we did not have control of the situation. I want everyone to know that we have from the very beginning of this issue been focused on taking the fastest route to getting things back online. What that does not account for are the possible setbacks that can be encountered along the way.
So where are we at. The restores are in process. They are moving forward a a good rate. As for an ETA, I don't have one. There are too many variables. it could be 4 hours, it could be 12 hours. Just too many variables.
Now for some specific comments.
About the invoice being sent out. Timing is everything. For most customers that have been with us for a long time our invoices always come out on the 1st. Just bad timing. Now should an email have been sent out separately. Yes it should have. No excuses.
As for getting some services up and running such as email. That might be possible but once you start enabling services you start stealing resources away for completing the other restores. It is usually most efficient to focus on getting everything restored and then turn up the server. Now, some further perspective. The ideal hosting environment would be distributed where services such as email, Mysql, Apache (web) and DNS each run on their own server. That way when a server crashes you lose only one service, not all services. Now, with the server migrations we have taken one step toward a distributed hosting environment. We are using clustered DNS servers running on two other servers in our network. That is why when you put your domain in a browser it times out (domain found but not responding) instead of being a domain name not found. Furthermore, this DNS arrangement is allowing anyone using a 3rd party email service (i.e. google gmail) to continue to receive their email because a domain lookup is succeeding.
I apologize for the length of this post. I hope this information helps and gives you some perspective on why things are taking so long. I am as frustrated as you are because we have never taken this long to restore a failed server.
~John Burns
|
|
|
|
|
Logged
|
|
|
|
|
Adel
|
 |
« Reply #25 on: July 02, 2010, 02:07:27 AM » |
|
Thank you for the update John. A little late for me. I've told all my clients we should be coming back on line during the next couple of hours - looks now like it will be much longer. Friday evening (6:15pm) hoping most will be doing their Friday drinks now and forgetting about their website and email.
Now that we have a number of people using the Forum, how about we discuss possible contingency plans: I never want to go through this again.
I have discussed changing the DNS of affected clients to our other server so they could at least get their emails but after debating the pros and cons of potential propagation delays (to and fro) with them they decided against it.
After the move to Defender we had a few occurrences of problems with PHP but the clients could receive their emails and I could upload a splash page for them - they were perfectly OK with that. But this situation has been a major concern for everyone.
I'm just a little old web developer, so would appreciate any suggestions on how to reduce the impact of a situation like this in the future.
|
|
|
|
|
Logged
|
|
|
|
|
junaid@payandgo.biz
|
 |
« Reply #26 on: July 02, 2010, 02:18:06 AM » |
|
hi
ok thank you for info in detail.i and my all customers are waiting to restore server back online today to start using email and other services as half business day has passed away here in high tention and i am getting calls from my customers again and again and again which is also creating a big hassle and due to prolonging of this issue customers trust spoiled on me because they are just blank and frustrated since yestersay and but i am praying server gets activate as quick as possible and perform smoothly and these types of things that all services are down shouldnt happen in future because ladt yar in paril 2009 Raid card problem took 3 days server remained off so customers mentioned me that in past this server and services and support were efficient and reliable but in 2009 and 2010 quality got down and problems comes out time to time.please plan a good strategy and principles and techs to avoid such things in future.
|
|
|
|
|
Logged
|
|
|
|
Tim
Newbie
Offline
Posts: 5
|
 |
« Reply #27 on: July 02, 2010, 04:52:20 AM » |
|
John,
Thanks for the details. You said "Furthermore, this DNS arrangement is allowing anyone using a 3rd party email service (i.e. google gmail) to continue to receive their email because a domain lookup is succeeding. " You dont mean that this is actually a way to get email from the server do you? I assume you mean that google, etc is not reporting back the connection issue. right? Email is 100% of my problem right now. The actual sites are advertising for the most part. Email is doing business. I have to have that email up in less than two hours....... PLEASE do what you can to make that happen. And I would appreciate regular updates until this thing is done. On the hour, every hour. Does not matter if it is the person who mops the floors, have them post where we are on the current restore process. 50% 60% etc. I think you guys owe that to us.
Thanks,....
|
|
|
|
|
Logged
|
|
|
|
|
junaid@payandgo.biz
|
 |
« Reply #28 on: July 02, 2010, 05:26:26 AM » |
|
hi
webhsp person said in my ticket that server is on now and accounts are restoring then we will activate all services and problem will resolve.
at least tell me which domains i can check on server to satisfy my clients that services are near to get back online and they will be able to use emails and ftp http all very soon.
|
|
|
|
|
Logged
|
|
|
|
|
junaid@payandgo.biz
|
 |
« Reply #29 on: July 02, 2010, 06:17:22 AM » |
|
Posted on 02 Jul 2010 05:24 AM hi
atleast activate email services after copying necessary configuration files on server then customers could start email services . websites,databases etc things should be second priority to restore and activate.
customers are waiting for their email services activation please convey my message to technicians to accept my request of possible.believe me customers continiously calling me tie to time and my all work getting disturb due to keep focus on forum and ticket system to keep them updates and with me in this situation.
|
|
|
|
|
Logged
|
|
|
|
|
elytradesign
|
 |
« Reply #30 on: July 02, 2010, 07:19:23 AM » |
|
John, thanks for the detailed explanation. It helps to know more clearly what's going on at your end. I do agree with what several others have asked - can you please provide more regular updates? I don't doubt that you are doing everything to restore the server - but we need to hear more regularly from you, or Mark or someone who can continually keep us informed. This is a trying time for all of us.
|
|
|
|
|
Logged
|
|
|
|
|
elytradesign
|
 |
« Reply #31 on: July 02, 2010, 10:11:07 AM » |
|
Just want to make sure that when everything goes live again that all the email that we haven't been able to receive will be restored as well. My clients are asking - Thanks.
|
|
|
|
|
Logged
|
|
|
|
|
WHSP-Mark M
|
 |
« Reply #32 on: July 02, 2010, 10:12:54 AM » |
|
Hello All,
I must apologize for the lack of updates on this, John and I have been working towards getting this up and running so much that I almost lost sight of who this affects.
I have talked to almost all of your at one point or another and I must again apologize not being more forth coming with information but there hasn't been much to report until now.
We now have the MySQL databases as well as all of our individual configuration files that are necessary in order to get cPanel fully functional. At this moment we are now into the 'meat and potatoes' of the restore, and that is the actual restoration of your home directories which contain your emails and more importantly your websites. This process is going to be longest portion of the restore as it contains the most data, however we are restoring on a alphabetical basis, so within the next hour or two clients that have a's, b's at the start of their username should start coming back up. We are easing this slowly so we do not have everybody hit the server at the exact same moment and cause excessive slowness.
With regards to preventing this in the future. We have now have mounted new 1TB hard drives as a /backup drive outside of the RAID5 configuration on each of new servers and are currently configuring cPanel to do backups every 2nd day to this drive. With the cPanel backup utility backing up to this drive, we will be able to pop this drive out and pop it into a new server and start restoring accounts in the event of another catastrophic failure such as this.
We will also continue to run our CDP backups in conjunction with these cPanel backups so we also have a copy of our clients data off the server to ensure that the data is safe in the event that the backup drive fails.
At this moment we are about 30 minutes from starting to bring some websites online. We are still doing restorations of the home directories and a few accounts should start coming alive (restoring accounts in alphabetical order based on username, so a's will come online first). As a home directory restores the account will come online for everything but email. In order to ensure that mail is not lost because a message was trying to deliver itself to a home directory that wasn't copied over yet we are NOT turning on the email service until all the home directories are restored. We realize this is a inconvenience but we are trying to ensure as little data loss as possible as we bring things online.
This is the home stretch folks, there is light at the end of the tunnel, John, Pat and all of us here at WebHSP want to thank you all for your patience and understanding during this dark time (no pun intended).
We are committed to restoring the great service that you have come to know and appreciate and we hope that we can turn this forum into a community in which our users can grow with us.
Thank You. The WebHSP Team.
|
|
|
|
|
Logged
|
|
|
|
jann
Newbie
Offline
Posts: 12
|
 |
« Reply #33 on: July 02, 2010, 10:22:39 AM » |
|
Hi John, You got me with that report, the tech talk descriptives blurs my vision. Im no wiser as to when you will be up or if all the mail will be recovered. We need warp speed now Scotty, that's why we put our TRUST in your firm. Times and answers please.
|
|
|
|
|
Logged
|
|
|
|
|
junaid@payandgo.biz
|
 |
« Reply #34 on: July 02, 2010, 10:37:40 AM » |
|
hi
any updates regarding server its very long time customers are calling me again and again and frustrated. all things should activate now its very difficult to conveybce customer now
|
|
|
|
|
Logged
|
|
|
|
|
WHSP-Mark M
|
 |
« Reply #35 on: July 02, 2010, 11:10:29 AM » |
|
Hi Jan,
Basically what we mean is that websites are being restored in alphabetical order at this time. We are currently in the B's at this time!
To clarify on the email service, email messages that are sent during this time frame (now that the server is back online) will still be collected but will not be delivered until all the accounts are fully restored. This is to ensure that messages don't bounce back to the sender in the event that a message is sent to a persons account that has yet to be fully restored.
I will provide more updates as we progress on the letters.
Cheers, Mark
|
|
|
|
|
Logged
|
|
|
|
jann
Newbie
Offline
Posts: 12
|
 |
« Reply #36 on: July 02, 2010, 11:14:40 AM » |
|
Sorry, but this clarification sound equally evasive. Be straight, it effects the TRUST factor between us. That's all we really have to sell. What about messages from Wednesday and Thursday, Are they captures, will they come through? Is that yes or no? Thank you,
|
|
|
|
« Last Edit: July 02, 2010, 11:55:27 AM by jann »
|
Logged
|
|
|
|
|
greeny
|
 |
« Reply #37 on: July 02, 2010, 12:05:27 PM » |
|
I am expecting that mail sent over the last day and a half will be delivered, eventually. Not because webhsp would have captured it, but because the sending servers would have experienced what were viewed as temporary errors (couldn't establish a connection) and would have queued the mail to be retried later. Mails servers conventionally will retry for 3 days, and since it's been less than three days since the failure, queued messages should not have permanently bounced, and should get redelivered, eventually.
|
|
|
|
|
Logged
|
|
|
|
|
WHSP-Mark M
|
 |
« Reply #38 on: July 02, 2010, 12:22:34 PM » |
|
Hi Jan,
I am afraid that the messages sent while the server was down were not able to be retrieved as they would have bounced back to the sender stating it was undeliverable but greeny is correct, some mail servers will queue the messages to try to re-send, so SOME messages may come through, it really depends on the configuration of the mail server that sent the message.
As another another update roughly 450 accounts have been restored. A total of around 1350 account were on our defender server, but the restores are going much faster than our first attempt to restore was going.
We are now on D and heading into E. Slowly but steadily folks!
Cheers, Mark
|
|
|
|
« Last Edit: July 02, 2010, 12:25:07 PM by WHSP-Mark M »
|
Logged
|
|
|
|
|
WHSP-Mark M
|
 |
« Reply #39 on: July 02, 2010, 12:42:08 PM » |
|
Hi All,
I am recompiling PHP\Apache on the server at the moment so you may notice your websites not displaying. This is only temporary and should be cleared up when the recompile finishes in about 15 minutes. This was necessary to get Zend Optimizer installed and working on the system.
Cheers, Mark
|
|
|
|
|
Logged
|
|
|
|
|