Unexpected downtime: 1 August 2016

Update: As of 9:20 AM GMT, 1 Aug 2016, all services are up and running normally. Thanks for your patience!

Today, because of bug in our server cleanup system few of the instances suffered data loss. As a best practice to ensure data integrity, we have decided to rollback to a backup of few hours ago. This migration will be over within 2 hours (By 8.30 AM GMT, 1 Aug 2016) and all services will be back to normal.

Meanwhile, We’ll be online on chat and reply any emails if you need any more information.

Some of software’s darkest failures from recent history

No matter how big or small a development team is, no software is without bugs. Bugs can be minimized but not isolated. There will be always a case which nobody ever thought in team, one which would present you at worst possible timing.

You don’t see it coming – you’re shocked to see something’s wrong but can’t say what or where. It doesn’t make sense.

We’ve all been in such scenarios before – mysterious unless we find the rationale behind it.

Sometimes these bugs are a result of rare scenarios which would’ve been impossible to think of. Such rare events can also be called as ‘Black Swan events’

A Black Swan event refers to a highly improbable and impactful occurrence which is impossible to predict. Its shock value causes astonishment and disbelief because people can never think of such an event occurring. The concept of black swan events was introduced by the writer Nassim Nicholas Taleb in his book, ‘The Black Swan: The Impact of the Highly Improbable’.

Taleb mostly focuses on the ideas of fragility and antifragility in financial, medical, economic, and political systems. The Black Swan concept dates back several centuries ago and it was based on the mistaken belief that all swans are white. However, in the 17th century, black swans were discovered in Australia, exposing the limitations of our information and imagination.

Taleb’s book doesn’t talk about how we can apply the Black Swan concept to software systems but there are some valuable lessons that the testing community can draw when it comes to testing the performance of software systems.

Software failures result from a variety of causes – mistakes are made during coding and undetected bugs can be in hibernation for a long time before causing failures.

These are some catastrophic failures resulted because of software bugs which nobody could think of:

1. Ariane 5, 1996

On June 4, 1996 the Ariane 5 rocket, which was scheduled to put telecommunications satellite into space, exploded just after lift-off. The European Space agency had spent over a decade for developing the $7 billion rocket and preparing it for its first voyage. The total destruction was valued at $500 million. The inquiry board’s investigation report attributed the failure to a software error in the inertial reference system.


2. NASA’s Mars Climate Orbiter, 1999

On September 23, 1999, NASA’s Mars Climate Orbiter was destroyed in space because engineers failed to make a simple conversion from English units to NASA mandated metric units. As a result, the spacecraft dived too low into the Martian atmosphere, where it could not handle the stresses inflicted by the Martian atmosphere. The stresses crippled it and the spacecraft was hurtled on through space in an orbit around the sun.

3. American Power Grid collapse, 2003

On August 14, 2003, areas of the Northeastern United States and Southeastern Canada experienced widespread power failures that resulted in the shutdown of nuclear power plants and disruption of air traffic in the affected areas. More than 50 million people were affected, and the disruption cost was $13 billion. Investigators found that a maintenance engineer forgot to turn on a trigger that launched the state estimator after a particular interval. Incidentally, a software program that sends an alarm about an untoward incident also failed. To make matters worse, the backup server also failed and as a result the cascaded line failures went unnoticed until the entire transmission system collapsed.

4. AT&T Telephone System, 1990

On January 15, 1990, AT&T’s long-distance telephone switching system collapsed. Sixty thousand people lost their telecommunication service while around seventy million telephone calls went uncompleted. The crash started in a single switching- station in Manhattan. However, it spread station after station across America until half of AT&T’s network had gone haywire. On investigation, the AT&T software engineers found that the crash was because of a bug in AT&T’s own software which enables switches to synch up with each other. Unfortunately, the software bug caused miscommunication between the switches and as a result the entire network collapsed.

5. Iran Nuclear Plant, 2012

Undetected vulnerabilities in software have paved way for scathing cyber-attacks as well. An insider in an Iranian nuclear facility used a USB pen drive containing Stuxnet worm. The worm replicated itself from one machine to another and ultimately crippled the control systems in the nuclear plant thereby sabotaging the entire Iranian nuclear program.

These case studies suggest that no matter how much we estimate and plan, unexpected events seems to throw these plans into chaos. Merely hunting for black swans will not suffice as we are living in an ever changing world full of uncertainty and human fallacies.

There are also number of teams who work on minimizing probability of such events. And these jobs are rising in number as we begin to rely more on technology. Here’s what they do:

Risk assessment

The journey towards a robust and resilient application starts with understanding how software can be fragile, and how to strengthen it for overcoming the fragility.

Failure Brainstorming

Some large companies also have separate teams for Failure Mode Effects Analysis (FMEA) that are used to identify risks and points of failures within as well outside the boundaries of the present environment. FMEA requires right people from different domains to use brainstorming and lateral thinking for identifying all the components, modules, dependencies, limitations that could fail in production environment and eventually lead to the system collapse.

Survival Strategy

Knowing that technology will fail is first motivation to prepare for survival strategies. This means, being prepared for failure and having alternative methods in place to contain the damage.

Fault-Tolerant System Design

Lot of research is going on in designing fault-tolerant systems. Depending on domain, these systems can either shut them off until the issue has been rectified or switch to alternate ‘safe systems’ until they are fixed by manual intervention.

I guess these type of tasks cannot be done in your typical office hours under a deadline. Level of intuition and intelligence required for such brainstorming is just off the charts and may be impossible for many of us. Thinking about all of it reminds me of Einstein’s following quote:

“We cannot solve our problems with the same level of thinking that created them”

5 impactful questions a test management tool will answer in your team

Imagine you’ve joined a new organization as a QA manager and you see a great product. The product you think which can be next big thing.

Sure your new company isn’t where it can be right now, but it has huge potential.


So as a QA manager or anyone in charge of quality for that matter, where will you begin your job?

In most cases, companies hire staff for testing and leave them on their own. So when you are all alone against big responsibility, what to do?

Continue reading

Step-by-step guide to integrate Ranorex Test Automation with Test Collab

Ranorex is easy-to-use test automation software (yet available for Windows OS only). A step-by-step wizard helps to set up the test environment and quickly get started.

For Windows application development and testing, it makes perfect sense.

Non-programmers can use the script-free drag & drop functionality, whereas professional programmers can use an API for C# and VB.NET to enhance their test suites and recordings.

It has a powerful GUI recognition covers all requirements in terms of accuracy and unique identification. It will recognize and find the element anyway even if the button’s shape or color changes. Facility to reuse code and action modules across multiple test cases with click & go functionality. This will save a lot of time when changing multiple test cases. Recording tests is very simple. Just press the record button, start your manual testing and It remembers all of the steps. Delete redundant steps with an easy-to-use editor.
Continue reading

Manually testing Feature Branches the right way

For many teams it is essential to work with different branches at same time so the main repository stays stable while development can still progress at a fast rate. Developers can create their own branches from the trunk/baseline and work independently on it. (Read more about that on Martin Fowler’s blog)

This creates a few problems when it comes to testing:

  1. With the changes that occur in branch, some test cases are affected too and needs to be changed.
  2. Multiple branches means maintaining multiple copies of a test case.
  3. Testing individual branches requires a tester to refer to the updated list of test cases only.
  4. Some branches will be merged to trunk/master sooner than later, that means at some point you will be merging these updated test cases to the trunk/master.

Continue reading

8 free new testing tools which are making development teams more productive

WOW, we’re seeing so much innovation in software testing this year.

A part of me can’t help but feel overwhelmed every time I see all these brand new awesome tools available which promises so much. More tools bring me more pressure, but they also bring insane improvements in how things get done in a team. That’s why – we never stop looking out for these new tools.

Every year new tools are doing exponentially more than their respective predecessor.

Imagine this:

About 8 years back, it took me over 100 hours to setup a simple CruiseControl build system from scratch for our very first product. Continue reading

Test Collab v1.12 released: Introducing Test Plans, Improved Requirements Management and coverage reports

We’re proud to announce our latest release Test Collab v1.12, which introduces Test Plans, an in-built requirements management engine, new coverage reports: for entire project, version-wise and module-wise and several enhancements and bug fixes.

Test Plans
You can now contain multiple test executions under one entity, i.e., test plan. Besides acting as a parent for multiple test executions, you can also use test plans to mix-and-combine multiple configuration values for creating multiple test executions. Consider this example:

Continue reading

1 2 3 4