An Exercise in Analysing Estimates

I’m currently coaching an organisation undergoing a Scrum implementation and we’re having some problems with velocity within teams and the consistency of estimates across teams. The teams are delivering well, but velocity is a little choppy and the teams are feeling a little blind as to their true rate of progress.

As an exercise we looked at the stories the teams sized up and took on across a few sprints, and we looked at their estimated effort for each story once they did the task breakdowns and had built their sprint backlog.

The Data

Here’s what we saw after a few sprints:

Team 1

Points	Avg Estimated Time	Standard Deviation
3	2.5	1.32
5	2.67	1.89
8	7.67	3.25
10	9.75	1.77
13	9.67	7.23
20	19.86	9.33
30	22	0

Team 2

Points	Avg Estimated Time	Standard Deviation
3	2	.71
5	8	0
8	9.5	4.69
13	23.5	16.99
20	44.67	9.29

Before you measure, understand the goal

The temptation when looking at any numbers here is to over analyse things and fall into the trap of “we need better estimates” and forget about the true aim of any Scrum Team, which is to deliver creatively and productively deliver business value to your customers.

The primary goal is always delivery! Not better estimates. The only reason we estimate in the first place is to help our product owners forecast when items are likely to be completed.

With that said, if we have bad estimates we will have bad forecasts and unhappy product owners and customers. We would like to have reasonable estimates so that we have reasonable forecasts, but not if it means we spend so much time estimating that we forget to get out there and build some awesome stuff that makes people happy.

The purpose of this exercise is to improve our understanding of the estimates we are providing, But remember, better estimates are a secondary objective and inconsequential compared to the prime objective of delivery.

When scaling, should teams have a consistent baseline?

Both teams are working on the same product, though in different areas of the product to avoid stepping on each other’s toes. Initially each time has sized the work they are doing using individual product backlogs.

Do they need to have some level of consistency between teams? Maybe – maybe not. In my customer’s case they would like that consistency because they’re having trouble knowing how long things will take and are having trouble forecasting using the velocity numbers from each team. While at the moment the teams are fairly separate, this may not always be the case and if the teams end up working on the same area of the product it would be nice to know that if Team 1 sizes something at 5 points that Team 2 would size the item at 5 points as well.

If the teams stayed distinct all the way through development, this consistency wouldn’t be required.

Cross team sizing comparison

Look at the estimated effort for a 13 point story in Team 1. It’s about 10 hours. The same 10 hours in Team 2 is an 8 point story.

Why the difference? Is it just because Team 1 is much faster than Team 2? Do they just have a higher velocity and are more awesome than the other team?

Is it because Team 2 is working on items that are harder than they estimated when they sized them?

Honestly, the numbers can’t tell you. You would have to look beyond the numbers to see what’s going on.

In the case of my customer the two teams are roughly equivalent. Same team size, roughly the same domain knowledge and skill level. As such I would expect that both teams estimating the same sized items would come out with approximately the same number of hours for the effort involved.

When that’s added to the understanding of the numbers I’m inclined to think that Team 1 is simply estimating using higher numbers than Team 2. This is not uncommon for teams starting with Scrum and learning to do sizing for themselves. As long as they stay consistent, their team velocities will cancel out any “padding” of the story points they have done.

Relative sizing is “Relative”

Now we come to the more interesting thing we can consider in the estimate statistics and the one I’m much more interested in raising the awareness of within the teams.

Firstly, no team sized any 1 or 2 point stories. This is a smell straight away for me and makes me think the team are padding their sizes. After talking to the team, I know this to be the case and it’s something they’re having to unlearn.

Next, if we consider relative sizing then the difference between a 5 point story and a 20 point story should be about 4 times.

In Team 1, a 5 point story is 2.67 hours. A 20 point story should be around 10 hours. Instead we see that 10 hours works out to be around the 13 point size and the 20 point story is about 20 hours. Almost 8 times the 5 point items.

Maybe it’s just that Team 1 didn’t use 5 as their “average” size story, but rather 8 points. Let’s see. An 8 point story is almost 8 hours. OK. So a 20 point story would be about 20 hours – not bad. However the 13 point story doesn’t fit, nor does the 5 point story.

Only 3 estimate brackets?

In fact looking at the average estimates it would that the team can only estimate in Small, Medium and Large timeframes where small is about 3 hours or less (half a day). Medium is a day (8-10 hours) and Large is 20 hours (2-3 days). Again, this is not uncommon for teams starting out and something I’ll need to work through with the teams to help improve their understanding of what they are doing so that they can inspect & adapt.

What about Team 2?

Doing the same analysis of Team 2 we see that a 5 point story is 8 hours estimated work. That means a 20 point story should be around 32 hours. Well, a 20 point story is 45 hours, it’s a difference, but not overly large..

However, the 5 and 8 point stories are fairly similar in size, so maybe the 8 is more akin to a “medium” story. 8 points is about 10 hours, give or take, so a 20 point story should be around 25 hours. Now we have a size gap of almost 50%. This is very similar to the behaviour we saw in Team 1.

Again, looking at the sizes it would appear that we have 4 obvious sizing ranges. Small, 2 hours. Medium, 8 hours, Large, 3 days and Very Large, 5-6 days.

Why didn’t Team 1 have a similar Very Large story size in their estimates? Likely because they recognized the Very Large story and broke it down into smaller items.

Why measure standard deviation?

You will have noticed that the stats have a standard deviation column. This is so we can see the volatility of the estimated effort for the various story sizes. For example the 13 point stories for Team 2 are all over the place. A standard deviation of 16 hours is very large – that’s a 2 day variation in effort and likely indicates that the team is still learning what their story points feel like in terms of effort.

DON’T ABUSE THE NUMBERS – They’re just indicators

Now that we’ve looked at these numbers, what do we do with them?

We want to use them to Inspect and Adapt; to learn how to be better than we are today, but we must remember that the numbers are just indicators. We may even be looking at numbers that are misleading. If we pay too much attention to the numbers people will start to change behaviour to make them look better. We don’t want the teams to start gaming the numbers since that would reduce visibility and transparency.

While the statistics would seem to indicate that the team do not completely understanding their requirements (the high standard deviations), or that they are padding estimates and still learning what relative sizing is all about, we cannot rely on the statistics alone.

We should take these numbers to the teams for their next retrospective and talk them through. Let’s see what the team can make of them and what steps they suggest for getting better at estimating.

Since these teams are wanting to improve, information like this can help.

Given the estimate size bandings one suggestion for the teams is to move away from story points for a time and adopt T-Shirt sizing instead. Given they have already got this with their Small/Medium/Large time breakdowns it may help them with their estimating in the short term and then we can revisit the points approach in later sprints once they have a better understanding of themselves and what they are estimating.

The final thought: It’s OK to look at your statistics. Learn from them, but don’t be ruled by them.