Real Life DR & BC, with VMware SRM

Well, I am back from a little excursion outside of Tokyo to recharge and get temporarily away from the reality of the situation at the nuclear reactors up North in Fukushima.

What better time than now to post an article about Disaster Recovery and Business Continuity using VMware’s Site Recovery Manager product.

Overview

As seismically active as Japan is, it would seem a functional requirement of vSphere designs to encompass DR/BC planning, so I was surprised to find out that we were one of the few companies in Japan implementing SRM in a production capacity when we started the project about 2 years ago. I suspect that the infancy of production VMware deployments here has some influence on this. I also suspect companies that have production deployments well in hand will now look toward the DR/BC benefits of this product, especially now and in the near future.

Background

Since I joined the company, going on 3 years ago, we have maintained 2 on-site datacenters (1 in Eastern and 1 in Western Japan). The main purpose up until my arrival was to provide fast access to essential services such as e-mail, file server, AD, etc. but the increase in speed of our private MPLS network between the Eastern and Western offices has made this less of an issue.

We looked at using the Western Japan datacenter as a recovery site in the event of natural disaster or other break in business continuity. At first we used a manual process to test and implement this. We were able to plan licensing for VMware SRM and implemented in about 4 months, encompassing critical (SLA backed) services comprising about 10 VMs.

Implementation

We ended up having 2 “Recovery Plans”, in VMware’s terminology, each of which was tested mutiple times in the past 2 years. We were able to automate the entire process, save the DNS switching, which required some manual intervention during the testing phase due to our use of static DNS. The only way to get the DNS to change automatically, and hence allow the services to be available to users externally is through the use of dynamic DNS and DHCP. We will look at this more closely going forward, as we had some issue that I will describe shortly.

We also automated the switching of the VMs’ IPs, DNS, routing, etc. using the IP address mapping CSV file described in the SRM documentation. This was tested as working until recently when I upgraded to vCenter 4.1 and the IP address mapping configuration was inexplicable lost.

Finally, we use our SRM implementation in an additional, less common way. Whenever we have a scheduled power outage at one of our sites, we use SRM to carry out a controlled failover to the other site for critical external user services. Once the power failure or test is completed, we reverse the SAN replication and manually failback the services (as VMware SRM 4.1 currently doesn’t have a procedure for this).

Disaster and Resulting Failover

On 3/11 at 2:46PM local time, our VMware DR plan immediately went into motion. Thankfully we still had access to systems at our primary site, so we tested the accessibility of services as a first line of defense. Everything was fully operational. I performed some additional tests of the recovery plans at that time, to ensure that we were able to failover to the Western office, should another quake take out the power, or remove our access. All tests checked out OK.

After the smoke cleared, we learned that rolling blackouts would be taking place in the area of our Eastern Japan office, so we made plans to perform the “Recovery” operation to the other, functional office. For the most part the operation was sucessful, but not without a few issues. Namely, due to the lack of IP address mappings, we had some VMs for which the IP address did not switch over automatically. This, in combination with the fact that we didn’t document some of the DNS changes in the appropriate zone files meant that we had to figure out the IP addresses manually and add entries to the zone file instead of commenting/uncommenting the appropriate entries as we had done during previous testing. Also, it turned out that without my knowledge, some additional VMs were added to the Recovery Plan without having IPs reserved in the Recovery Site. We had to work dynamically to account for these last-minute changes.

All in all, we got the issues resolved manually and restored services to users within an acceptable timeframe. Thankfully this took place over the weekend when the end users are out of the office. The problems encountered will serve as lessons for better planning for future DR plans and will prompt management for more frequent DR tests.

I hope that the relative success of the running of the DR plan will also open the eyes of management to more prevalent use of VMware, specifically for the disaster recovery benefits. I can say, without going into detail, other critical systems (outside of the virtual infrastructure) did not fare as well!

Thanks VMware! Thanks SRM! Mission accomplished…

5,958 total views, 2 views today

, , ,

  • Scott

    Great post. One question, though. Before the disaster hit, you said you were using SRM for planned maintenance. You did, however, quickly go over the details of "manually failing back". Do you think you could elaborate a bit?

  • Christopher Wells

    @Scott

    Thanks for the comment. Yes, indeed… we use the product to move VMs between datacenters during times of planned or power maintenance in our 2 buildings here. We have the SAN replication (using NetApp's SnapMirror in this case) replicating from our Eastern Japan to our Western Japan office.

    Annually there is a power check at each of our buildings, so we will plan a controlled move of the pertinent VMs from one site to the other before this takes place. Once the maintenance is completed, we reverse the replication by performing a re-sync operation. Once the re-sync is completed, We manually shutdown the VMs in the Recovery Site, sync any final changes, and then power on the VMs in the Protected Site again.

    I hope this explains adequately. If not, I would be happy to post a follow-up with the exact steps on the filer that were performed, if that would be helpful to you.

  • Michael White

    Hi Christopher,

    I am from VMware, and I work on the SRM team. It is great to hear that things worked out for you. I wonder if you would provide me with more detail on what happened and what went wrong so we can learn from this and see what we can improve?

    Michael

  • IP

    When you upgraded to 4.1 and you lost your csv file, what did you do? and where did you create the mapping, on the recovery SRM server or protected site ?

  • Christopher Wells

    @IP

    The mappings where created in the Recovery Site, if I'm not mistaken. I wasn't able to upgrade from 4.0 to 4.1 of vCenter, so I did a fresh install and pointed to the pre-existing vCenter database (separate server). I think I had to do the same with SRM (it's been a while; can't say for sure). Perhaps I didn't re-point the new SRM configuration to the IP Address Mapping .csv after the re-install. Human error is the most likely explanation.

    It would be nice if there was an interface in the recovery plan to enter these IPs, DNS, routing, etc. as opposed to the less-than-user-friendly CSV. For the future, I will look to dynamic DNS and DHCP to remedy these issues.

  • Tristan Todd

    I have had very good experiences with SRM deployments where DHCP and dynamic DNS are implemented. If you can't do stretch-VLANS, it's a great way to go.

  • Christopher Wells

    @Tristan

    Now that things have settled a bit over here, I am going to discuss the possibility of using these technologies to automate things a bit more. This will allow our DR/BC executive management team to execute the DR plan with minimal communication to IT.

    Thanks for the comment!

  • Leon

    Hi Christopher

    Fantastic article ! Recent events are certainly a tragedy however a positive may be that in future years Japan will be the leaders in IT disaster recovery. I have only installed SRM in a test environment but it looks very good.

    Have you personally considered leaving Japan due to recent events? I live in Australia and my wife is Japanese, we will be travelling to Japan in 2 weeks with my 6 month old…I am a little concerned. Her friends in Japan have said many expats have left Japan.

    Cheers
    Leon

  • Christopher Wells

    @Leon

    I have lots of ideas floating around in my head at the moment as to whether to stay or whether to go. I don't have the best Japanese language skills either, to be honest, so that's playing into my decision.

    If you have any doubts about visiting Japan, I wouldn't worry about that in the slightest. Everything is getting back to normal for everyday life. I only wonder about the long term effects to the economy, pension system, taxes, etc.

    Thanks for your comment!

  • Pingback: Coho Data at Storage Field Day 6! - vSamurai (仮想侍)()

  • Pingback: Coho Site-to-Site Replication - vSamurai (仮想侍)()

  • Pingback: Coho's SRA plugin - helping you do the right thing™()

  • Pingback: Coho Data at the Tokyo VMware vForum - vSamurai (仮想侍)()

Powered by WordPress. Designed by WooThemes