A Guide to the Recent Salesforce Disruption

By: Rob Jordan

4 minutes

As some of you already know, Salesforce.com experienced a data loss with the NA14 instance. They produced a press release that might be self-explanatory for those well-versed in the industry lingo, but for the rest of us, it is a bit challenging to understand. With a little help from Lucille Ball, who always has some explaining to do, we’ve translated the press release into sizeable bits.

Salesforce:

At approximately 5:47 p.m. PDT on May 9, 2016 (00:47 UTC on May 10, 2016), the Salesforce Technology team observed a service disruption for the NA14 instance.

Translation:

A circuit blew causing the system to not operate correctly.

Salesforce:

In an effort to restore service to the NA14 instance as quickly as possible, the team determined that the appropriate action was to perform a site switch of the instance from its primary data center (WAS) to its secondary data center in Chicago (CHI).

Translation:

The data affected a few clients. Rather than leave the data on the troubled WAS instance they decided to move it to CHI.

Salesforce:

At approximately 5:41 a.m. PDT on May 10, 2016, the Technology team observed a degradation in performance on the NA14 instance.

Translation:

Even after the data was moved to CHI, some clients still couldn’t access their data.

Salesforce:

The team engaged our database vendor who determined that the database cluster failure had resulted in file discrepancies in the NA14 database.

Translation:

The best way to describe this is to picture Lucille Ball packing chocolates into wrappers, and the chocolates represent the data. The data was getting pushed over so quickly that the system was unable to to copy and process the data. Lucy and Ethel simply couldn't get the chocolates in the wrappers fast enough! The result was chocolates getting sent out without their respective wrappers. Or users unable to access their data.

Salesforce:

The teams pursued several approaches for restoring service to NA14 in the CHI data center, including repairing the file discrepancies directly. Each attempt to restore service resulted in errors or failures that prevented these approaches from continuing.

Translation:

We tried to get the chocolates back in the wrappers and it wasn’t working.

Salesforce:

The storage array for NA14 continued to run in a degraded state through May 11, 2016. To prevent the volume of customer activity on NA14 from negatively impacting performance due to the backlog of built up jobs and new customer activity, the team halted several internal jobs, including sandbox copies and weekly exports, and staggered initial customer activity coming into the instance.

Translation:

Rather than pick up all the wrappers off the floor and get the old chocolates wrapped, we decided it was best to just to focus on wrapping new chocolates. We did not let the unwrapped chocolates go to the consumer. The chocolate stoppage was only temporary.

Salesforce:

The root cause of the initial power failure to the WAS data center was a failure of a main critical board that tripped open a circuit breaker pair. The breakers are used to segment power from the data center universal power supply ring and direct the power into the different rooms. This failed board caused a portion of the power distribution system to enter a fault condition. The fault created an uncertain power condition, which led to a redundant breaker not closing to activate the backup feed because that electronic circuit breaker could not confirm the state of the problem board.

Translation:

The conveyor belt on the WAS chocolate factory broke and the CHI factory did not have enough resources to handle the extra chocolates/data.

Salesforce:

Our vendor responsible for the WAS data center power circuits has replaced the failed components.

Our internal data center operations team is performing a full audit of power and failover systems in all data centers to ensure power distribution failover will correctly respond in the event of similar failures in the future.

Translation:

We fixed the breaker in the WAS factory and we have upped the capacity in our CHI factory to handle more chocolates. Dont worry! None of our other factories have the same issues as the CHI factory. Finally we are going to make some adjustments to how we respond to chocolate factory production in the future.

Salesforce:

We sincerely apologize for the impact this incident caused you and your organization. It is our goal to provide world-class service to our customers, and we are continuously assessing and improving our tools, processes, and architecture in order to provide customers with the best service possible.

Translation:

Sorry things turned out the way they did. We will make sure it doesn't happen again.

What it all means for you

Unless you got a notice that your data was affected by the outage, it likely had little impact. You can rest assured that Salesforce is making a list and checking it twice to ensure that this won't happen to anyone else.

If your data was affected, you were probably contacted. The outage was for such a short amount of time that hopefully not too much was impacted, but if you have lingering problems, reach out to your Salesforce AE.

For more info on how to safely backup your Salesforce instance, check out the blog below:

BABY GOT BACKUP