Archive | Disaster Recovery

Coho Data at the Tokyo VMware vForum


Those readers out there that know me personally, know that I lived in Japan for almost 7 years (nearly half of my IT career). I’ve blogged a fair bit in Japanese and about some of the events around the 2011 Tōhoku earthquake and tsunami here & here. That said, Japan holds a special place in my heart. It’s always great to get back for a visit.

It’s been close to 3 years since I’ve last been back. I’m excited about visiting the far east to see what’s new and what’s changed in enterprise IT in the years since. I suspect that it’s virtually unrecognizable from when I lived there from 2004-2011.

In those days, VMware wasn’t used nearly as heavily in Japan as it is today. Not to mention the fact that Business Continuity and Disaster Recovery were merely an afterthought before the days of the 2011 earthquake, despite the fact that Japan is the #1 most seismically active country on the planet!

Add to this the fact that the storage market has changed tremendously in the years since then. There were really only 2 players in the market, EMC and NetApp, and none of the multitude of next-gen storage vendors vying for market share like there are today. Think about that for a minute; No Nimble Storage, No PureStorage, No SolidFire, No Tegile, No Tintri, No Coho Data, etc… along with too many more to name. The new guys are being disruptive and taking market share away from the old guard of EMC and NetApp. We all have our strengths and weaknesses, key use cases and varying costs, but one thing rings true: The next-gen storage vendors are innovating at a geometric rate that the legacy guys are struggling to catch up with!

These are just a couple of reasons why I am so excited for Coho Data to be a part of the vForum in Tokyo this year. I was an attendee for 2-3 years while I lived there, but I suspect a whole lot has changed since then, especially looking at things from the vendor point-of-view. I’m really looking forward to disrupting the storage market in Japan, much like we’re doing here in North America.



22,633 total views, 5 views today


Implementing Site-to-Site Replication with Coho SiteProtect

Now that I’ve given you a quick overview of the architecture of Coho SiteProtect, I’d like to provide you with the basics for implementing SiteProtect in your data center. This is the 2nd in my series of posts on our site-to-site replication offering. As I discover the best practices for deploying SiteProtect in various infrastructures and scenarios, I’ll document those here as well, so stay tuned for those…

Without further ado, here is the step-by-step set-up procedure for SiteProtect…

Pairing the Sites

The first step in setting up remote replication is establishing a trusted relationship from the local site to the remote site. This is done from the Settings > Replication page in the Coho web UI, indicated by the gear (settings) icon (Figure 1).


Figure 1: Settings > Replication page

From here, click the “Begin replication setup” link which brings you to the configuration screen for the local site (Figure 2).


Figure 2: Settings > Replication > Local Site page

Here, you’ll specify the network settings for the site to site communication. It is worth noting that the replication traffic is sent on a VLAN to simplify network management for enterprise environments.

Here you can also configure bandwidth throttling for outbound traffic in case you need to limit the usage of the site to site interconnect. The same can be done on the remote site which means that both incoming and outgoing throughput can be controlled. Bear in mind that by limiting the traffic, you may increase the time it takes for a workload to finish replicating, in other words, increase the RPO.

Once that’s complete, you’ll click “Next” and specify the IP and password of the remote DataStream. Click “Next” again to proceed (Figure 3).


Figure 3: Settings > Replication > Remote Credentials page

Once the wizard confirms a connection to the other side, you’ll specify the remote system’s VLAN, replication IP address, and netmask, as well as the default gateway for the other side and click “Next” (Figure 4).

Note: On this page the bandwidth limit relates to outbound traffic from the remote site; or put another way, the inbound replication traffic arriving at the local site.


Figure 4: Settings > Replication > Remote Network page

Finally, you’re brought to step 4, which is the “Summary” page and allows you to review the configuration before applying the settings. Click “Apply and Connect” to complete the wizard (Figure 5).


Figure 5: Settings > Replication > Summary page

From this point forward, you’ll be presented with the following view when you go to the Settings > Replication page. You can see here (Figure 6), the IP of the remote node and that replication is active.


Figure 6: Settings > Replication page (completed)

Configuring Workloads and Schedules

Now that the initial pairing is complete, you’ll visit the “Snapshots and Replication” page to customize which workloads are replicated as well as the snapshot & replication interval for each (Figure 7).


Figure 7: Snapshots / Replication > Overview page

Here (Figure 7), we provide an overview of the workloads. This is a dashboard which tells us the number of VMs with snapshots as well as replicated snapshots. For all of a site’s workloads to be protected, they should all have replicated snapshots, ensuring that any of those workloads can be recovered on the remote site in the event of a disaster.

We also provide a summary of the workloads covered by replication, how many bytes have been transferred as well as the average replication time. These statistics provide the assurance that replication is functional, and also the rate of change of the data, allowing you to determine if your replication interval is appropriate for the bandwidth you have available. If your average replication time is greater than your snapshot schedule, you can modify it accordingly.

To configure or modify workloads, proceed to the “Workloads” page (Figure 8).


Figure 8: Snapshots and Replication > Workloads page

Here (Figure 8), we denote the local vs. the remote workloads, provide a record of when the last snapshot was taken, and display the assigned schedule.

Note: VMs which have been deleted are denoted with a strike through the name.

Under “Snapshot Record”, you can click on the calendar icon to view snapshot date, name and description, as well as the status of replication. In this example, we have recently enabled the workload for replication denoted by the word “Scheduled” (Figure 9).


Figure 9: Snapshots and Replication > Workloads > Snapshot Record page

To manually protect a specific workload, click the camera icon next to that workload. This will allow you to take a manual snapshot and replicate that snapshot (Figure 10).


Figure 10: Snapshots and Replication > Workloads > Snapshot page

Most users will want to protect a number of VMs at once. The best way to do this is from the “Default Schedule” page (Figure 11).


Figure 11: Snapshots and Replication > Default Schedule page

In this example we have selected a RPO of 15 minutes by replicating the snapshot every 15 minutes. The frequency of snapshots is best determined by the needs of the application and the automated snapshot schedule for Coho offers flexibility, from minutes to months.

Note: Quiescing snapshots puts the system in a state that maintains application consistency before taking the snapshot, however this is only available in the daily and weekly schedule. Taking quiesced snapshots more frequently may cause significant performance penalties. These performance penalties are not related to the Coho storage but to how snapshots are executed within the VMware environment. A crash consistent snapshot (no quiesce) can be done very frequently on the Coho storage without performance penalty.


In the event of a disaster you’ll want to be be able to bring up your applications in the remote site. This is done from the “Failover/Failback” view (Figure 12).


Figure 12: Snapshots and Replication > Failover/Failback page

Initially, failover and failback are disabled in order to protect you from instantiating multiple copies of the same VM. You make the decision (from either location) to put the disaster recovery plan in-motion. If you’re ready to proceed, click the “Enable” button to enable failover (Figure 13).


Figure 13: Snapshots and Replication > Failover/Failback page (enabled)

You can now go to the remote DataStream and clone your replicated workloads to the remote system. Open up the web UI of the remote DataStream and, again, go to the Snapshots and Replication > Workloads page (Figure 14).


Figure 14: Snapshots and Replication > Workloads page (remote)

Click the “Remote Workloads” checkbox to filter by those workloads. These are the workloads available for failover from the primary to the disaster site. Choose the workload by clicking the calendar icon. Browse the recent snapshots and choose one to clone from, by clicking the clone icon (Figure 15).


Figure 15: Snapshots and Replication > Workloads page (failover)

Once you’ve selected the desired snapshot, enter a VM name and choose a target vSphere host. Click “Clone” to clone it and recover it to the destination site. The workload is now failed-over to continue serving data to your users. Just power it on in vCenter and you’re ready to go.


If at some point, the primary site comes back online, we support failing workloads back to their original location. This is done from the Snapshots and Replication page. On the workload that you’d like to failback (Figure 16), click the calendar icon to view the available snapshots, then click the red arrow to sync the snapshot to the original VM. Once the VM is powered on, your app will be back in the original location with all of the changed data from snaphots replicated from the remote site since the failure occurred; simple and easy just like it should be.


Figure 16: Snapshots and Replication > Workloads page (failback)

Well, that’s it for the initial implementation. As you can see, Coho SiteProtect is easy to get set-up and configured in any environment. Next, we’ll dive into some of the best practices of how to configure SiteProtect for optimal performance for environments of various sizes and requirements.

Until then, if you’d like more info about Coho SiteProtect, click here!

8,685 total views, no views today


Introducing Site-to-Site Replication with Coho SiteProtect


While our engineers have been hard at work preparing the bits for our site-to-site replication offering, I have been testing the technology in preparation for a slew of technical collateral on the feature. In addition to introducing Coho SiteProtect here on my blog, I want to share with you a quick overview of the architecture. You can find more on this feature at the Coho Data blog here and here. Stay tuned for more on this topic in future posts!

Replication is something I am extremely passionate about and I’m very happy to talk about it with whoever has interest. I’ve witnessed firsthand what having a solid DR plan can mean to a business, and I and many others rely on it to deliver data in any circumstances, both predictable and unplanned, to their customers today more than ever before.

Now, let’s dive into the architecture…

Coho’s SiteProtect replication implementation reflects the unique features of our patented scale-out system architecture. The two most notable elements of SiteProtect are dynamic data replication and lightweight snapshots.

Dynamic Data Replication

For Coho, replication is a core architectural pillar that not only replaces technologies like RAID for data protection, but also is used in scaling out the capacity of your cluster when you add nodes and for data re-balancing across those nodes in times of congestion. Additionally,we use replication when decommissioning nodes or during a failure of a node to rebuild a replica of data on the surviving nodes. Because we replicate objects in the Coho Bare Metal Object Store, we can do this virtually at the block level as new files are created or as old files are modified. We keep the data synchronously updated so that the workloads never skip a beat.

For data availability in the event of a disaster, we have  extended this functionality to other clusters at  remote sites. Because distance typically introduces latency and bandwidth challenges, we shift to an asynchronous approach for remote replicas. This prevents the performance issues you may see when the primary workloads are competing with synchronous replication traffic, not to mention saturating your network links.

Lightweight Snapshots

Our snapshot implementation leverages copy-on-write clones of the original VMs. That means, storage capacity consumed is proportional to the amount of data changed since the previous snapshot was taken. The DataStream replicates snapshots at regular, user-selected, intervals, so each subsequent data transfer only replicates the changes since the previous one. Add to this the fact that we compress the data over the wire and you’ll see significant reduction of bandwidth usage.

The real-world benefit is alignment to application Recovery Point Objective (RPO) needs. It can be as frequent as a few minutes to days or weeks. Coho SiteProtect does not force you into one size fits all.

Failover & Failback

To recover workloads, you simply clone the replicated copy into vCenter at the remote site. It will immediately inherit the original snapshot/replication schedule, providing the ability to failback when the original site comes back online. This provides a Recovery Time Objective (RTO) in the order of seconds for your critical workloads. If the workload already exists in vCenter, we will simply update the storage configuration to reflect the latest replicated snapshot. If you want to run on an older snapshot you can do that as well.

DR Testing

Finally, while a good disaster recovery plan is important, testing replicated data isn’t always easy. Replicated Snapshots are immutable and a simple clone of a snapshot can be used for DR testing. The clone can safely be discarded after DR testing has completed.

Key Benefits

  • Asynchronous, snapshot-based – provides fast recovery
  • Active/Active sites – delivers efficiency
  • Granularity at the virtual machine – provides control
  • SSL data transport – ensures security of your data
  • Replicate only changed data – bandwidth efficient
  • Compression – Reduced bandwidth usage


For more information on Coho SiteProtect, click here!

11,627 total views, 3 views today


Coho Site-to-Site Replication


Anyone that knows me well, knows that I have a unique history with Business Continuity and Disaster Recovery software as well as its practical use in business. In case you didn’t know, I used VMware SRM along with NetApp to recover from the 2011 Tōhoku earthquake and tsunami in Japan. In areas with natural disasters, it is becoming increasingly imperative but now also relatively common for companies to require or even demand their storage products support remote replication. Companies will use it either for a simple reason, such as shipping backups off-site or the more complex use case of disaster recovery, or both. It’s extremely convenient and easy to use when implemented properly.

That said, a common ask of us from customers here at Coho has been some form of site-to-site replication. We definitely didn’t design this as a “me too” feature. The technology has been built into the product since day one. We use synchronous replication instead of technologies like RAID to store redundant copies of data across the independent backend storage nodes. We leverage an asynchronous version of this for the remote replication feature. There was a lot of thought that went into the other components of our implementation to make it easy to use and enterprise fully-featured from the get-go.

Here are some of the key features of our replication release:

  • Asynchronous, periodic, snapshot-based replication
  • Active – Active site support
  • Virtual Machine granularity
  • Encryption
  • Compression
  • Bandwidth throttling
  • Simple UI with one-time setup and very easy configuration
  • Flexible replication schedule

* I’ll also add to this list that we’ll be introducing support for VMware’s SRM via a SRA (Storage Replication Adapter) in the very near future, so be on the lookout for information here and elsewhere on that.

If you’d like to read more details about the release, head over to the official Coho blog and read Doug Fallstrom’s (Sr. Director of Product Management) post on this topic.

Speaking from experience, site-to-site replication gives us another must-have enterprise grade feature, further solidifying Coho’s place at the cutting edge of new storage technologies… and this is only just the beginning!

10,141 total views, 3 views today


Japan Earthquake Aftermath – Revisited

Today I am feeling a bit of the same emotion that I felt in the months after the great earthquake that my family and I, along with the people of Japan, experienced just over a year ago. The reason for this is something that I saw posted when I woke up this morning regarding how SoftBank, with the help of NetApp, helped Japan and its citizens in the recovery effort. See video

My experience, which I talked about here, mirrors some of the same things referred to by SoftBank. Imagine not being able to travel to the office due to transportation being completely shut down. Or rolling blackouts meaning that throughout the day, if the trains happened to be running, not knowing whether they would still be running once it was time to return home. Or not having heat or cooling in the office to be able to work comfortably. Or food at the convenience store or any other number of basic necessities being available. Or gas to drive your car…

Even though cloud was not intended for the purpose of “working from home”, as with any disruptive and innovative technology, it is interesting to see what technology is capable of in times of need or times of urgency or even new ways to use a technology that were never dreamed of before. Think about the pure energy-saving aspects of virtualization and cloud and expand that to energy savings from working remotely and you can clearly see what’s possible.

I do remember hearing stories of how SoftBank was giving away its services and donating money to help in the recovery. I commend them for their efforts. I can say that I am even more proud that they happened to do it using NetApp technologies.

As you may know from reading my blog entry from a year ago… I had the experience, as a customer, of using technologies from NetApp and VMware to failover critical services in our infrastructure during the disaster. This only scratched the service of what transformative technologies can accomplish, as evidenced by SoftBank’s success. Look to the past and present and you will see further evidence of how technology saves lives. Look to the future and you can see the potential and the promise of cloud computing. Exciting times indeed!


Executive Summary (PDF)

Technical Case Study (PDF)

The story behind SoftBank’s Epic Story

Dave Hitz’s blog

Val Bercovici’s blog

5,388 total views, no views today


Powered by WordPress. Designed by WooThemes