Greg McNelly's Blog

Wednesday, September 14, 2011

2 Rules for Investigating Production Defects

Production defect investigations can be golden opportunities for testers. Yet I have observed a tendency for these investigations to be less than productive, swinging quickly from blaming the test team to excusing them, and, in turn, excusing practically everyone else as well. I appreciate the underlying empathy, but making excuses is habit forming, and it distracts us from the tasks at hand - learning and improving.

I hate wasting these opportunities, and I don’t want to waste our time. So I have adopted two rules, intended to short-circuit the unproductive blaming and excusing exercise, allowing us to move into the learning...

Rule 1: All production defects could have been caught by a test.

Sometimes the best way to end finger-pointing is for someone to take (at least part of) the (initial) blame. For best effect, someone in a test-leadership capacity should state this rule loudly as soon as the investigation starts. This statement may be met with stunned silence or instant agreement. Either way, we have just opened up some time and space to imagine a better test, a test that would find the problem, and possibly to generalize and explore for tests that may find adjacent problems along multiple dimensions. Of course, these tests may not be practical or economically feasible, but I frequently find that they are both very practical and very affordable. Too many investigations fail to challenge us to find these new tests. These new tests can take the quality of our testing, and the quality of future releases to a “whole ‘nother level.” If we can delay the investigative team’s empathic response, and if, as test leaders, we can broaden our shoulders a bit, we can take advantage of this opportunity almost every time.

Now that we have done that, it’s time to remind our colleagues that the missing tests were merely a superficial aspect of the underlying problem...

Rule 2: All production defects are caused by problems with the requirements, design, code, deployment, operation and/or usage of the system.

In other words, the testing problem is not THE problem. To sing an old refrain, testers neither cause nor fix defects. While Rule 1 gave us a pause in which to learn the testing lesson, Rule 2 reminds the investigators that there’s something more important that needs addressing. These investigations have multiple responsibilities. (Perhaps the term ‘root cause’ has reinforced a bad model in our thinking.) Maybe your stakeholders are satisfied with simply being able to detect the problem should it happen again, but I don’t know many who like leaving money on the table.

Challenge yourself to find a better test, then challenge your teammates to prevent it from finding another problem. The next time you are involved in a production defect investigation, give these 2 rules a try, and let me know how it goes (gmcnelly@gmail.com).

Labels: defects, production

Wednesday, August 24, 2011

CAST 2011

Some notes from CAST2011...

3 things that made me smile...

1. Jon Bach’s bug reports – Jonathan Bach, the conference chair, kicked off both days with a brief address to the attendees that included a bug report – things that had not gone quite right with the conference, and what was being done about them. How apropos for a testing conference! I imagine chairing such an event is an arduous undertaking. It’s nice to see someone pull it off with such grace and humor.

2. Progressive shout-out – James Bach, in his day 2 keynote, mentioned Progressive as an example of a large corporate shop that embraces context driven testing as a professional practice. It was a simple, but profound, moment of gratification for myself and my colleagues who have worked to make this so. It was also a great set-up for my talk, but I’ve had to swear to at least three different people that it was purely coincidental. James and I did not arrange the timing (I swear, really).

3. My talk: “Developing a Professional Testing Culture” – I was honored that James Bach selected me to present a talk on this subject. After the past few years of struggles, lessons, and triumphs in this arena, for me, this was a story that practically wrote itself. And, for once, I think I did not rush through my talk. The audience was great. I think there were close to a dozen green cards up before the facilitator uttered the words “open season”. I was left with the impression that many others are experiencing similar struggles in the cultural arena. A couple of themes that developed in open season were the importance of building on small victories and organizational support. Many stayed in the room past the appointed end time, and we talked and exchanged business cards well into the lunch hour. I owe a special “thanks” to Rob Sabourin for his feedback immediately afterward that will help me be a better presenter in the future (not quite “just in time” ).

5 things that will have me thinking...

1. Bolton on repeatability (during his day 1 keynote) – To paraphrase Michael Bolton out of context, “Seeking repeatability hurts many testing efforts… Testing for adaptability increases coverage…” We tend to harp on repeatability, I still think it is frequently an important problem, but taking a step back in probably healthy.

2. Positive deviance – also from Bolton’s keynote. In many ways, this may be naming the thing that we have seen succeed in our testing culture. Perhaps by naming it, we can begin to better harness it... http://en.wikipedia.org/wiki/Positive_Deviance

3. “Tacit and Explicit Knowledge”, by Harry Collins – a book mentioned prominently in James Bach’s keynote.
http://www.amazon.com/Tacit-Explicit-Knowledge-Harry-Collins/dp/0226113809

4. Intersubjectivity revolution – also from Bach’s keynote.
http://en.wikipedia.org/wiki/Intersubjectivity

5. Experimental design, pairwise testing, etc – In his session, Justin Hunter lamented the seeming lack of interest in experimental design among our community. I think I also sense this, and it has me scratching my head too.
http://hexawise.com/combinatorial-and-pairwise-testing

That’s it. All in all, another worthwhile event by AST. See you all next year in the bay area.

Labels: CAST2011

Monday, November 29, 2010

Grey Swan Defocusing

My continued musings on The Black Swan by Nassim Taleb, but this time with my software tester hat on...

All software bugs are not created equal. A small, hopefully very small, number of them have extreme, possibly even catastrophic, impact to a system’s stakeholders. We don’t see it coming, yet later claim that all the signs were there. These are Black Swan bugs.

By definition, the Black Swan bug escapes to the field. My only defense is preparation for the consequences. However, to abuse the metaphor, not all non-white swans are truly black. I may be able to avoid unnecessary damage to my customer, and to my reputation, by dedicating a small portion of my testing to hunting grey swans. When it’s time to take a step back from my testing, perhaps I can translate Taleb’s contributing factors into defocusing heuristics...

My Grey Swan Defocusing Checklist

Confirmation Bias - Are we confirming a specification, claim, example or observation without questioning or exploring it?

Try contradictory patterns of test data.
Try drastically different patterns of test data.

Narrative Fallacy - Are we buying a story because it’s convenient?

Skip and reorder steps in common sequences.
Attempt to break relationships.

Ludic Fallacy - Are we assuming that everyone plays by the rules?

How might internal entities circumvent the rules? (business rules, operational rules, project rules)
What forces might be exerted on the system by external entities?
Are we trusting a tool or environment that’s hiding something from us?

Beware the Scalable - Are we underestimating bug severity?

Repeat errors or situations that are expensive to process.
Look for ‘Perfect Storms’ - combinations or sequences of events that cause severe problems.
Exhaust resources - what happens when that log fills up?

Silent Evidence - Are we ignoring, or even discarding, pertinent information?

Look at losers - discarded test failures, discarded test cases, especially controversial ones.
Have we glossed over something that might cause a problem?
What are we not observing?

Labels: Black Swan, Defocusing, Grey Swan

Monday, November 8, 2010

WOPR15

I recently attended WOPR15 (WOPR = Workshop on Performance and Reliability, http://performance-workshop.org). The theme of this workshop was “Building Performance Test Scenarios with a Context-Driven Approach.” It’s amazing what you learn when a couple dozen or so bright people get together to share experiences and vigorously discuss and debate ideas. This blog will fall woefully short of portraying the experience effectively, much like looking at a picture of the Grand Canyon, but here goes...

I have to applaud the content owner, Michael Bonar, and the WOPR organizers on selecting a line-up of presenters that came at this challenging theme from a variety of creative, sometimes non-intuitive, angles. This made it even more interesting to see the emergence of patterns, or sub-themes, across these disparate stories.

From my perspective, the most compelling of these sub-themes was “Testing in Production.” Companies like FaceBook, eBay, Google and Microsoft support operations of such massive scale that it is not feasible, or cost effective, to support production-scale test environments. Instead, they invest in processes and tools that allow them to use their production systems for testing.

Another prevalent sub-theme was “Just get started.” There seems to be a consensus among these leading practitioners that waiting for all questions to be answered, or all forms to be filled out, is a trap (frequently self-imposed) that hand-cuffs testers and puts them under extreme pressure later. Start exercising the system as soon as you can. Maximize your ability to plan, design, test and report concurrently.

Yet another significant theme touched upon by several of the presenters was “Have a plan, but adjust aggressively.” Frequently something happens in the midst of a testing effort that will lead you away from the plan if you pursue it. Good! That’s why we test! We are finding potential problems – those problems don’t always respect your plans.

Some other nuggets:

Luck and intuition may play a role in the success of your testing effort, but an exploratory approach, executed by skilled testers and test managers, positions you to harvest that luck and intuition. (My take on Jon Bach’s experience report (ER))
Great testing can become a selling point for your product. (from Paul Holland’s ER)
What could you do if you were not afraid? (A brave attitude toward development and deployment - from Goranka Bjedov’s ER)
Focus on wait times and queue lengths when looking for problems, not so much on CPU and memory utilization. (also for Goranka’s ER)

Like I said before, I cannot do this workshop justice in a short blog entry. There were several other presenters and many more interesting points of discussion. I continue to derive tremendous value from this format. This is focused experience sharing, discussion and debate of our practice at its finest. The next WOPR, which is WOPR16, will be hosted by my employer, Progressive, and I am the content owner. The theme is “The Intersection of Performance and Functional Testing.” A call for proposals will go out in January. The conference dates are April 28 – 30, 2011.

Labels: Performance Testing, WOPR

Tuesday, October 5, 2010

Acceptance Test Driven Development: A Testimonial

The following is a testimonial I submitted to Ken Pugh recently - Ken is working on a book about Acceptance Test Driven Development (ATDD). Though it is far from a cure-all, in my current context, I have come to regard ATDD as a foundational development and testing practice...

In 2006, I started an assignment with a group that published service APIs to other parts of our company for the purpose of retrieving data from external vendors. At the time, they were testing most of their services manually, using the GUIs of the calling applications. Their testing was dependent upon the availability of both their own and their clients’ test environments, and disruptions were common. Additionally, their service request and response schemata consisted of thousands of fields, but only dozens of those fields were exposed directly in the client GUIs. The rest were calculated, defaulted or simply ignored. Not surprisingly, the test team felt squeezed by schedule pressure and quality problems.

Their management decided that they were simply out-gunned by the technical challenge before them, so they recruited some programmers with testing skills (and vice versa) to join the team. That’s where I and a couple of other test engineers came onto the scene. One of the first things we did was raise awareness about the low level of coverage for these relatively complex interfaces. (This was somewhat disconcerting for veteran team members, and a tribute to the maturity of all involved that these conversations were rarely contentious.) We also began surveying other test teams within our company to see if there were any tools already in-house that we could use to circumvent the client GUIs and go directly at our service interfaces. We discovered a group that was using Fit for a similar purpose and it was love at first sight.

We copied their implementation (Fit and OpenWiki with some customizations) to our environment and within days we were creating and executing tests for some of our larger projects. Within a few weeks we had these tools well integrated into our infrastructure and processes. Tests were now being defined during or shortly after requirements definition, frequently serving to clarify requirements, but we didn’t know then to call it ATDD. Soon developers were asking for our tests to run before check-in, and were helping with fixture design and development.

The number of tests for our systems typically increased five-fold as we introduced our implementation of automated ATDD, and we moved from executing a handful of test passes per project to a handful of passes per day. Defects discovered in our QA environment dropped dramatically because we were running the tests in predecessor environments - the tests became informal entry criteria. Test projects were costing about the same, and taking about the same amount of time, but quality was increasing significantly. In fact, the team’s quality ranking within the company, based on production availability of our systems, improved from “worst to first” in about a two year period.

In addition to the quality improvements, we gained a great deal of confidence in our ability to refactor our systems and move them through environments because coverage had increased substantially and test execution had become relatively effortless. Furthermore, the clarity, usability and credibility of the tests led to more collaborative test failure investigations. It was not uncommon to see developers, testers and business analysts huddled around a screen, or camped in a conference room, discussing the significance of patterns of red cells on a test result table - discovering and resolving issues in minutes where formerly it had taken hours or days of asynchronous communication. While their are many other ways that we have continued to improve our testing, nothing has been as “game changing” as our move to automated ATDD with Fit.

Labels: Acceptance Test Driven Development, ATDD, Fit

Tuesday, September 21, 2010

The Black Swan

One of my favorite things about working with testers is that they read a wide variety of interesting books - this is one of them. “The Black Swan,” by Nassim Taleb, is one of those books that has generated a considerable buzz within several communities of thought, including testing. I found that I could not resist reading this book as an investor on first pass, so I may need to re-read it as a software tester (perhaps another blog post to follow). Here is my summary of, and reaction to, The Black Swan...

A Black Swan event has three characteristics. It is an outlier, it has extreme impact, and it is later thought to have been predictable or even predicted. "...Rarity, extreme impact, and retrospective (though not prospective) predictability." September 11th, several market crashes, and WWI are examples of large scale Black Swan events. Black Swans may also be smaller in scale, or even personal, such as the beginning or ending of a romantic relationship; and they can also have positive impacts, such as an unexpected inheritance. At the time of this writing, I am 43 years old, and I would agree there have been several Black Swan events during my lifetime, though, unfortunately, no long lost rich uncles.

Taleb’s point is not just that Black Swans are real; it’s that they actually drive the course of history (and the courses of our lives), much more so than ‘normal’ events. Furthermore, many of our current methods of forecasting the future and managing risk are not only ineffective, they actually incubate Black Swans as they exacerbate our exposure to them.

Really? How can this be? Is this guy just saying something sensational to sell books?

Taleb discusses at length institutionalized misunderstandings of the nature of uncertainty, decrying the Gaussian bell curve as the “Great Intellectual Fraud.” One of the problems with the bell curve is the nature of outliers. The bell curve suggests that their rarity practically eliminates the significance of their effect, allowing us to predict with false confidence; while Taleb holds that outliers in some important fields, like finance and history, profoundly affect the nature of all subsequent events.

Another problem with the bell curve is that of regress. In other words, we need more data to better define the shape of the curve, but we assume the shape of the curve before we plot the data. Try to explain this to someone who is not a statistician; they will be asleep before the second wave of your hand.

Silent evidence is a more general problem with our modeling tools. The data that we observe most easily is likely to be produced by winners or survivors of some process. The losers are usually harder to see, but they may be much more numerous, giving us an overly optimistic view.

Taleb does not eschew all mathematical tools, however. He compliments the concept of scalable randomness and the related work of Mandelbrot in the field of fractals. He claims that markets, for example, are better modeled as fractals because of the model’s ability to ‘blow up,’ but that their exponential factors are, alas, still not knowable with any useful level of precision to allow prediction.

“...scalable randomness is unusually counter-intuitive.”

“There is no such thing as a “long run” in practice; what matters is what happens before the long run.”

Taleb also discusses some psychological factors that expose us to Black Swans, for example, confirmation bias. This is our tendency to accept confirmation of our beliefs and ignore contradicting evidence. Narrative fallacy is another such psychological factor. This is our tendency to organize data into stories, to imagine causal links between events, to ‘fill in the blanks.’ This makes it easier for us to remember more information, but the imagined links may be phony.

Information itself is not as valuable as we might think. Given the aforementioned problems with our understanding of uncertainty, and our psychological tendencies, the addition of more information to our situation may only serve to solidify our grip on dangerously flawed models.

Let’s just pretend for a second that I buy all this, what does it mean to me? How should it affect my behavior?

Taleb does offer a piece of semi-concrete investment advice, and that is to use a “Barbell Strategy.” Put ninety percent of your money in very stable investments, and the remaining small fraction in highly speculative vehicles. You gain exposure to positive Black Swans without risking the substantial impact of negative ones. (As of this writing, I am considering, but do not feel compelled by, this suggestion.)

Otherwise, despite Taleb’s appreciation of the pragmatic, and distaste for the theoretical and academic, practical advice was admittedly sparse in this book. However, I think it is safe to say if you try to change your predictive models to account for Black Swans, you’ve missed the point. I imagine Taleb is simply telling us, "DON’T BE THE TURKEY." - STOP PREDICTING. Or, if you must predict, please be aware that you are likely to do so horribly. STOP RELYING ON THE PREDICTIONS OF OTHERS. Or, if you must do so, please protect yourself from the fallout. And finally, BE ROBUST AGAINST DISASTER, AND OPEN TO OPPORTUNITY (whatever that means to you).

Labels: Black Swan

Monday, September 13, 2010

CAST 2010

Now that I've started the blog, I'm going to reach back into the events of recent history for a few posts. One of those events was CAST 2010, the Conference of the Association for Software Testing.

Overview

CAST is a highly collaborative conference of testing professionals, smaller than the STAR conferences, but densely populated with smart, passionate, vocal people. Among those I met for the first time were Cem Kaner, Harry Robinson, Doug Hoffman, Scott Barber, Matt Heusser, Becky Fiedler, Ben Simo, Tim Coulter, Selena Delesie, Michael Hunter, Cristina Lalley, and Joe Harter. It was also great to renew connections with Michael Bolton, Rob Sabourin, Eric Proegler, Paul Holland, and Michael Bonnar. I know I am leaving somebody out - please feel free to yell at me.

(For you quants, that was 17 people. Conference attendance was about 105. That means that I personally networked with at least 16 percent of the conference attendees. Though the number would be much higher if I had recorded observations more carefully, this is still not bad for an extreme introvert - and a tribute to the nature and quality of this conference.)

CAST is conducted by prominent practitioners, not a corporation or group of vendors. (There are vendors sponsoring the conference, but they do not seem to be the focal point as they are at some other conferences. Actually, I felt sorry for them, at times, due to the lack of attention they received at their booths.) Many of the attendees pay their own way. As such, the level of 'engagement' is very high. The material presented is based on real-world testing. The viewpoints discussed are based on real-world experience. This is not a 'vacation' conference. I came away both energized by new ideas and exhausted from the constant mental stimulation.

My favorite take-away was a set of techniques for large scale testing that I believe will be directly applicable to improving testing in my current context. These techniques were outlined mostly in Harry Robinson's 'Exploratory Test Automation' tutorial and the session on 'Testing Large Scale Scientific Computations: The Short Circuit Method' given by Gaston Gonnet and Monica Wodzislawski, and they were built upon brilliantly in subsequent discussions with several other attendees.

My Presentation – Testability and Technical Skill

Overall, I think my presentation went okay. I rushed a bit, and people were a bit tired since it was the afternoon of the last day. I still have significant room for improvement with my presentation skills, but I walk away from this encouraged to continue improving.

Interestingly, this did not seem to be a controversial topic at all for the audience at this conference. They seem to accept and assume that testers benefit from technical skill. I was hoping to stir up at least a little challenge, but nada. I wonder why it is so controversial in my shop? Is this a localized phenomenon? Does it correlate to the aforementioned level of energy and commitment among the attendees?

Some Highlights

(There were many more, possibly excellent, sessions that I did not attend; these are just some notable points from the sessions I did attend.)

Exploratory Test Automation – Harry Robinson

Shared some creative ideas for generating large-scale random inputs for systems.
Described two specific approaches

Production grammar
State modeling

More creative ideas for creating light-weight dynamic test oracles
Put your machines to work while you are away from the office.
You can have crisp handoffs or quality code, but probably not both.
We need testers who can design.
I was able to spend a significant amount of time talking to Harry after the tutorial and he helped brainstorm ideas about how we can use these techniques to test our product.

Keynote on Estimating – Tim Lister.

Covered some common issues with estimating
Presented a method for measuring estimates – EQF.
Interesting analogy between estimating and hurricane forecasting.

I spent some time afterward discussing this analogy with Tim. The hurricane does not provide estimates and the forecasters don’t live in the hurricane. Does this mean we should try to have external parties estimate our projects? Hmmm.

Technical vs. Non-technical Skills in Test Automation – Dorothy Graham

Covered some of the basics of test automation skills.
Interesting discussion around whether tool independence is a worthy goal. It depends, of course. This is likely to be an issue we are discussing in my shop in the near future.
Others generally agreed with my observations that they have seen programmers learn how to test effectively more frequently than they have seen testers learn how to program automation effectively.

Investment Modeling as Exemplar for Exploratory Test Automation – Cem Kaner

As an avid amateur investor, this talk was interesting to me, but I never clearly made a connection to exploratory test automation. There was a lot of material in the slides, I need to review it again.
A controversial point: “GUI level regression testing is thought to be one of the industry’s worst practices.”
There was an interesting point raised by an audience member – we testers need to go to the conferences that our customers are going to, not just constantly talk amongst ourselves.

Testing Large Scale Scientific Computations: The Short Circuit Method – Gaston Gonnet and Monica Wodzislawski

This presentation was on a higher technical plane than any of the other talks I attended.
How to test complex long running programs with simple inputs and outputs, e.g. weather modeling programs.
Testability suggests where faults can hide from testing, and testability does not need an oracle. (That's deep, man.)
They enumerated four techniques for creating dynamic oracles.
This was a fantastic complement to Harry Robinson’s talk.

So that's a quick tour of CAST 2010 from my perspective. I thought it was a very positive experience, and will most likely try to attend CAST 2011, which will be chaired by Jonathan Bach in Seattle, Washington, sometime in July. Maybe I will see you there.

Labels: CAST, CAST2010, software testing