Analyzing Software Engineering Data

August 27, 2019February 6, 2025 Dr. Michaela Greiler 0 Comments

We are surrounded by data. Often, when we think about data-driven decisions, we think about analyzing data for sales or marketing purposes, or big data in the healthcare field. But, have you ever thought about leveraging your engineering data?

Well, at Microsoft and at many other leading software engineering companies such as Google or Facebook, some departments do just that: Leverage engineering data to improve how software is developed.

Analyzing engineering data is for you

This superpower should not be kept a secret to these large and leading software companies. And so, I start this new series on analyzing engineering data to show you all my lessons learned from analyzing and improving software engineering processes and tools at Microsoft.

By following this series, you learn how you can use the data traces engineers leave behind to understand, enhance and improve your own engineering processes and tools.

This first post gives you an overview of the types of data that are at your fingertips and highlights some of the use cases. In the next post, I’ll show you how you can build your own data analytics platform.

In the subsequent posts, I will deep-dive into how to collect, clean and analyze this data. I will discuss how you can use this data to increase your team’s productivity, speed-up testing, or understand bottlenecks and frictions in your software engineering process. Make sure to subscribe to my mailing list to not miss any of the posts.

Every day we leave data traces behind

Well, back to your engineering data. As engineers, we create a lot of data just by using our daily tools. We leave traces by writing design or requirements documents, communicating over email or chat, and by developing and submitting code. We also leave traces when we merge code changes, run tests, or build our software.

A lot of this data is unstructured. This means it does not follow any previously defined schema. Without knowing anything about the structure of the data, it is hard to analyze and transform this data.

Some of the engineering data, on the other hand, like markup or metadata, is semi-structured. This means that some data in those documents follow some known structure.

And few data sources deliver us structured data. This is data we can more easily analyze in an automatic way but is often also semantically less rich. Independent of whether the data is structured or not, the different data sources qualify for different types of analysis. Unstructured data can bear real treasures of insights, it is just harder to automatically analyze it.

Types of Engineering data

So, which types of engineering data do you find at your company? Well, I already mentioned a few, but here are the most common data sources you find in a typical software company:

Issue/Bug repository
Source Code Repositories
Code Review Data
Build Data
HR Database
Chat and Email Traces
Calendars
Company Wiki or Filesystem

So, what can you do with this engineering data?

If now the first thing that comes to your mind is a dashboard showing some test coverage data or churn information, you are far from it and at the same time so close. Let me outline different usage scenarios below.

Dashboards, visualizations, and reports

Dashboards highlighting engineering data often show:

the number or percentage of open vs. closed tickets,
the average time to close a ticket,
open code reviews,
code review turn-around times,
code coverage,
some metrics indicating the quality of your code,
or code velocity metrics.

I bet quite a few of those metrics and visualizations (and many more) are already provided out of the box by your tools.

The first question to ask yourself is, are you using those dashboards? Which ones are really relevant to your team? In which context do you use them? Of similar importance is the question of why aren’t you using them. You might have very good reasons.

Accelerate your team with Customized Dashboards

Do the out-of the box metrics and dashboards reflect what you want to know? Which visualization and dashboards would make your life and your team’s lives even better?

Maybe the definition of code review turn-around time reported by your code review tool isn’t that useful for you because the tool assumes so hit the “close review” button after completion. But in real life, your team just merges the pull request and never hits the button. Well, you can provide your team with your own review turn-around metric that is customized to your team’s practices.

Combining data sources is the real deal

Even though you can make customized reports that rely on data from one tool, such as the code review tool, there are more interesting scenarios for leveraging your engineering data.

You reach a new dimension of insights once you start thinking about how you can combine data from different sources.

I will lead with a tricky example, code velocity. Code velocity is the speed in which you implement new features or fix bugs. I say it is a tricky example because code velocity metrics can be very misleading. In addition, measuring code velocity will often lead to many unwanted side-effects, such as teams gaming the metrics.

Nevertheless, I chose this example, because I’m sure many (want to) measure it, and because it is an excellent example to show the drawbacks and pitfalls of such metrics.

Measuring code velocity

A straightforward way of measuring code velocity is by measuring how long it takes for a ticket to go from an opened state to a closed state (and many variations thereof). As you can imagine, this does not reflect any of the real engineering workflows and most of the time is not an indication of code velocity whatsoever.

So, when you have access to engineering data, you could instead measure how long it takes from moving a ticket to the active stage until the code solving this ticket is deployed to production. Even better, you can break down this measurement into smaller increments and analyze for bottlenecks in the workflow.

For example, looking closer you might see that code stays particularly long in the code review phase. Or, maybe the code lingers around several days before it is deployed. I’m not going to go into any details, because this is just meant to spark your imagination. Any meaningful metrics and follow-up investigation have to be tight to your team, your practices, and tooling.

Metrics and dashboard can be helpful or harmful depending on how you use them.

Unwanted side-effects of analyzing your engineering data

Well, that sounds amazing, you might say. Or maybe, you are more experienced and you already started to shake your head and pray not more teams adopt such metrics to – in the worst-case – increase the productivity of their engineering team. Well, I also hope I haven’t lost many readers before I can tell them: you have to be super careful using such metrics!

Let’s stay with the code velocity metric. Let’s say you implement this metric and you hope you will increase your team’s efficiency by measuring it. Well, I can guarantee you that your team, at the moment they hear about your measurement (even just rumors) are skilled and intelligent enough to game around your metric.

They might change the way they estimate the workload to perform well according to your metrics. They might produce less quality code to speed up development. Or, they work crazy hours to comply with your expectations. None of these are good outcomes. And even though it might improve your metric scores, you don’t have the impact you designed the metric for.

Metrics and dashboard can be helpful or harmful depending on how you use them. Click To Tweet

I’ll deep dive into this crucial topic in another blog post, but, for now, I just leave the scandal of Wells Fargo here as a cautionary tale.

But, if you truly understand the strength and weaknesses, and most importantly how you can use the metrics, they can have tremendous benefits for engineering teams and companies.

One-off-investigations

Well, using regular reports, or dashboards is just the beginning. The real power of your engineering data comes from the deep investigations you can perform. Applying statistical regressions, machine learning techniques or data mining to your engineering data can open you a whole universe of insights.

For example, you could find out if increasing your test coverage has a real impact on the number of post-release failures. Or you investigate if code that has a strong code owner has higher code quality. Or, as we did in another study at Microsoft, you could look at which code review comments really provide value to your engineers.

Each of the outcomes of these studies can help you improve your software engineering processes, as well as tools.

Your engineering world runs on data

What? You want to start analyzing your data right now!? I’m thrilled.

But hold your horses. You probably can’t just jump right in and replicate the same analysis and studies I just mentioned. Why not? Because you will miss one very crucial ingredient. The data. But didn’t I just tell you you have all the data sources already at your fingertips?

True, but most likely, you can’t just start analyzing this data. You will need to extract, transform, clean and store this data somewhere first. Where? In your very own data analytics platform. This data platform will give you access to all those data sources in a meaningful way. In the next post, I show you exactly that: how to build your own engineering data analytics platform.

I’m here to help

I have several years of experience with collecting, cleaning and analyzing engineering data at Microsoft. I also led several projects that improve tools and processes based on these insights. So, if you need help with leveraging your engineering data, make sure to follow along with my series by subscribing to my mailing list. Also, feel to reach out to me via twitter or book a free consultation. You can also join the engineering data analytics community.

Updated on February 6th, 2025