About management systems and test automation framework

(These are my views. It has nothing to do with my current or past employers. The narration here doesn’t infer my experiences with or existing practices of my current or past employers.)

Eventually every networking product evolves a management software either in form of an element management system (EMS) or network management system (NMS). Similar approaches are also taken by storage industry, bio-medical device industry and so on. Essentially, an embedded software needs some management software.

It is also true that in this combination, larger portion of intellectual property lies within the embedded product. It is usually the first one to get developed, sold and later for margin extension management software is developed.

If it were not for margin extension, most companies would line up with OpenNMS kind of a solution. A proprietary management system also allows a vendor lockup.

However, there also exists a keen, silent awareness that a management software is a secondary product. Less resources are given for its engineering and testing. If a management software product is developed and tested fully, it could be really a USP – but it seldom happens.

On the other hand, such step-motherly treatment is also given to the core embedded product’s test automation framework. Actually it is even worse to be in charge of a test automation framework. The work demands maturity of an architect but is usually treated as even less important than testing or test automation.

If you take a step back and look, both pieces of software are doing the same thing – using an application software to control the core embedded software.

A management system typically cares about FCAPS (Fault, Configuration, Accounting, Performance and Security). In many commercial systems it allows scheduled tasks.

A test automation framework also does the same! It should have a scheduling mechanism, configuration mechanism, fault monitoring to raise errors, accounting and performance measurement modules for system testing and of course, security framework – to protect itself and also in form of penetration testing.

That means a large overlap exists in the requirement sphere of the two. Ideally, common libraries and even common front-ends should be written for the two. When it runs in customer mode, it should act towards management; when it runs in testing mode, it should run towards testing automation.

Not only testing resources are saved that way, we can save a lot of testing efforts too. Say if 60% code of the management system is common with the test automation framework, during test automation runs, this 60% overlap will be hammered like anything!

Then the question is, why is this not a practice? The answers are surprisingly not technical.

  1. As I mention earlier, management systems aren’t usually implemented later than the embedded software.  Often their development is later than even the test automation framework – at best they are developed in parallel with the test automation framework. Program management for absorbing both of them is hard
  2. Also, as I mentioned, there 60% overlap in functionality – but there is also a gap of 40% each. That means, you have to spend 140% to get both. Most of the time, senior management (rightly) is unwilling to risk so much at the same point of time – when one of them is a secondary software and the other isn’t even going to fetch any top line!
  3. Here is the dirty secrete of Hi-tech industry (or as it is called IT industry in India). It has a caste system. Hardware engineering is held at the highest esteem, followed by embedded software, followed by application software, followed by test automation, followed by testing, followed by support. This hierarchy reflects in choice of tools – like hardware is tested in C++ – and applications may be written in C++; embedded software is written in C/C++ and often programmers feel offended if they are asked to try higher level language – and are tested using either scripting languages or human languages (manpower). In this hierarchy, a development manager for a management system, a Java guy, usually finds it extremely offensive to agree that his/her problem space has so much in common with “something as mundane as a test automation framework”, which is “nothing but a bunch of scripts cobbled together”. Every hierarchy, every caste system doesn’t exist on the feeling of inferiority of  “I am sitting higher than y but I am sitting lower than x”. It works on the feeling of superiority of “I am sitting lower than x but I am sitting higher than y”. The world needs everyone with fairly the same importance – but human pride makes such hierarchies sub-optimal yet rigid. Writing a common software for the use of high-caste developers and low-caste testing usually becomes unthinkable, unspeakable or at least impractical.

As I mention in one of my jokes, sadly, technical problems are often the simplest.

Q: What is harder than to colonize Mars?

A: To get the budget approval for it!

Cisco’s “Application Centric Infrastructure” – is it “Decisionless”?

Cisco has been promoting “Application Centric Infrastructure” as an alternative to Software Defined Networking (SDN).

I need to do more homework to appreciate the difference between SDN and ACI.

However, what struck to me was that ACI is about taking the policy out of forwarding path. As per my understanding of ACI, once a policy is set by the “intelligence” of APIC, hardware will take over forwarding.

This is strikingly similar to decision-less programming I have been advocating. Readers of this blog are aware that in decision-less programming, only I/O requires decisions. Business logic buried deep into FSMs of derived classes is implemented as policy look-up tables.

If my understanding of parallels so far is correct, I suppose ACI will have the same characteristics as decision-less programming:

  • There will be no “patchability” of policies. All the policies must be explicitly known, documented and implemented
  • The system will work predictably barring system level problems of running out of resources etc.
  • The system will be extremely fast
  • The system will be memory intensive
  • The system will require sophisticated input parsing
  • Testing of such systems will be trivial except at boundary conditions like array bound violations or at system level problems like thread lockups or disk overruns

Testing a feature in presence of many possible combinations of environment

Take for example testing of a router where we list features in columns of test areas:

L1 L2 L3
10BasedT STP RIPv1
100BasedT RSTP RIPv2
1G PVST OSPF
10G MVST ISIS
BGP

In the beginning of a product like this, most interest will be in making sure each of the protocols work. Stand alone testing of the cells will be more than enough.

Once most low hanging (and highly damaging) bugs that could be found through testing cells of the matrix alone, are weeded out of the software, QA progresses by making “experts” of testing going column-wise – L1 expert, L2 expert, L3 expert etc. This leads to more experienced testers and technically great bugs (like route redistribution having memory leaks). QA also arranges its test plans by test areas and features.

At this stage, only a section of bugs seen by a customer is eliminated and the complaint “QA isn’t testing what the customer ways” continues.

That is because the customer doesn’t use the product by columns. A typical customer environment is a vector selected one member from each column (at most). A product is likely to break at any vector.

As you can see, overall testing of the matrix above would require testing 4*4*5 = 80 environments. In reasonable products the actual number may be in 10,000’s.

***

Testing a feature in presence of many possible combinations of environment is a well-known QA problem.

Various approaches have been suggested. There have been combinatorial engines and test pairs and so on to help QA optimize this multi-dimensional problem.

The approach I discuss here is yet another algorithm to be followed by semi-automation. Just let me know your thoughts about it.

***

Let us define our terms once again:

  • Feature: A feature in the product [I know that isn’t a very great definition.]
  • Area: A set of related features
  • Product: A set of Areas
  • Environment: A vector (or a sub-vector) of Product, with each component coming from an Area
  • Environment for a Feature: A maximal length sub-vector of the Product without the Area of Feature

Please understand once again that QA expertise and test plans are structured by Area (a column in the matrix). The best way to test will be to test every test of every Feature against every “Environment for a Feature”.

This approach is simply uneconomical. So, what is the next best approach?

***

Before coming to that, we need to understand how test cases are typically structured. In a feature, typically test cases have some overlap of steps – like configuration or authentication etc or overlapping phenomena like exchange of Hello packets, establishment of neighborhood etc.

That means, there is a significant redundancy of tests from white-box point of view.

This redundancy can be exploited to assure that the product may stand reasonably well in diverse environments. As we discussed earlier, such an environment is a vector of that matrix, which in turn is a sub-vector plus a test condition.

***

Understanding so much brings us to a workable solution for testing more “customer-like” without incurring too much cost.

The key understanding from the above discussion is that Environment for a Feature can be changed independently of the test case (or the Feature or the Area).

That means if the tester can signal an “environment controller” at the end of a test case, the controller may change the DUT/SUT to another Environment for the Feature. After that change is done, the tester just continues testing the system to the next test case – till all test cases end.

Because it is less likely that number of test cases are a factor (or multiple) of number of sub-vectors, within a few test cycles reasonable amount of test steps will be tested across reasonably diverse environmental sub-vectors.

As a strategy of testing the product, QA can assure its internal customers that over a few releases, most interesting bugs that can be found economically will be found.

***

What are the downsides of this approach? For sure, the Environment for a Feature must not have any configurations about the Feature under test – or even the Area. That means the tester will have to always configure the Feature before going to the next test. If you are working on a Feature that takes VERY long to converge, you are out of luck. [VPLS in networking comes across as an example.]

Since most products don’t involve long signaling delays, let me focus on the optimization of this testing.

How can we find maximum number of bugs in a given feature (or the entire column) related to environmental changes within minimum number of steps?

The answer is obvious to the readers of this blog – by taking the sub-vectors in anti-sorted way!

About delay, loss, fuzz and unfortunate events

I covered importance of testing the impact of delayed response earlier.

It is also important to test for lost responses to see how queuing up impacts the application/device under test.

Semantically confusing fields in networking data (like http://192.168.1.255 over a 255.225.255.0 LAN) should be tested to avoid gaffes and towards more secure code. Protocol fuzz testing should ideally cover such tests.

At last, there is also a possibility of events related to other protocols happening at unfortunate events. How does your system behave when route summary is happening from OSPF to BGP AND an OSPF update arrives? Testing such scenario is extremely difficult for the want of right set of tools and skills. It also unleashes combinatorial nightmare for testing. However, with careful “code baking” policy, it is possible to find strange, bad and nasty bugs before customers find them in the field.

Urgent and Important of Test Failures

Not all test failures are alike. Some should be addressed urgently and some are important to address.

Interestingly, these two sets are DISJOINT.

That is, if a test failure should be tested urgently, it is seldom important. Those that should be fixed at any cost are seldom urgent (unless the team has really slacked).

How to determine which failure falls into which category?

Naturally, easiest to fix failures must be fixed urgently before the code moves too much.

On the other hand, if a failure is long standing, obviously a feature is broken and it is important to fix it.

As if the failure signal has to pass through filters, the outcome of an high pass filter will be urgent and the outcome of a low pass filter will be important.

A bug’s Afterlife

The question is similar to a question Nachiketa asked to Yama, the god of death. “What happens to the soul after death?”

The answer is: “None cares. It is just there till someone digs out a bug to prove a point on someone.”

And that is why Quality doesn’t improve.

Even the famous state transition diagram of Bugzilla doesn’t address one crucial point – there IS an afterlife of a bug! As a matter of fact, a bug may have many afterlives.

  1. Afterlife #1: A bug filed by a support/escalation engineer, found at a customer site
    1.  The purpose of such a fix is to work at the customer site, right? So conscientious QA keeps the bug in “Resolved” or “Verified” state and marks it as “To be Verified at the Customer Site” or similar phrase. It is the duty of the support engineer dealing with the customer to actually verify and close it
    2. Fixing a bug reported from a customer is often not sufficient. It is also advised to go out and declare that a defect has been addressed. This means Documentation team has to take notice of such a bug and include it in the next version documentation. So a clone of this bug must be assigned to the Documentation team
    3. Such a clone needs to be verified too, right? Typically a QA verifies such a documentation bug and closes it
  2. Afterlife #2: A “good” bug found by anyone
    1. Depending on the nature of the bug, the scenario may turn into a valid test case. So such a bug must be cloned as an assignment to a QA engineer to convert into at least one test case
    2. Such a test case needs to be verified and closed by QA too
  3. Afterlife #3: A “good” bug that has turned into a test case
    1. A valid test case is likely to get converted into a test script. So a bug may also clone for assignment to a test automation engineer
    2. Such automation needs to be verified by QA too

Here, cloning isn’t the only option but because a bug may split multiple ways – customer support, QA, automation and documentation, it is best to clone the bug and let each department worry about its further life cycles.

This is one of the reasons I disagree with ignorant non-QA (and also ignorant QA) folks who claim: “QA’s job is to find bugs”.

What do you say?

Sources of inspiration for test cases

AFAIK, SWE maintains that tests are derived from (Marketing/System/Software) Requirement Specifications – and that is the end.

If it were so, I am sure purists of Software Engineering would agree to my observation.

I have derived meaningful test cases from at least these sources:

  1. (Marketing/System/Software) Requirement Specifications
  2. Bugs reported from the field (escalations)
  3. Questions asked in mail groups for (System Engineers/Subject Matter Experts/Beta customers)
  4. Knowledge Base articles written for Customer Support Engineers
  5. Test Automation Scripts
  6. White papers from marketing – and competitor’s marketing

Where else have you got inspiration?