Know your assumptions, and test them
In my day to day I very often stumble upon yet another issue in one of our software systems that needs fixing. And most often, the issue is rather apparent and easy to fix.
This past couple of days (ok, almost a week) I have been stuck on a problem that I only really got solved when I talked to a colleague about a problem (or actually two problems) they had.
My issue had to do with something in one of three threads not doing what it was supposed to in very rare cases (about 5% of all runs), but attaching a debugger or inserting debugging print statements made the problem go away. At least it seemed that way.
My colleague was stuck on especially one tricky issue - which got me telling him: “Ok, you have tried poking at this for a day - what is it that you assume to be true?” And we got talking - and me being the old guy, having spent days and weeks tracking down weird issues, only to find that it was what I thought I knew to be true, that was the problem all along - I decided to try to be wise…
Ok, I told him (and another colleague in attendance) - you need to know what the symptoms of the problem is - what is it that we know to be true about the problem - and we listed up what was experienced by the users.
And next - try to come up with every possible thing - weird as they may be - that could cause this behavior… And then try to disprove as many as them as quickly as possible.
Next list up what you absolutely know to be true about the state of the system - and then try to disprove that.
… what you end up with should - hopefully - be a manageable shortlist of ideas to what the problem could be, and how to solve it…
It took me a night at home relaxing and a good nights sleep (and a good dose of coffee this morning) to have the epiphany: I was actually not doing that!
I decided to put debug statements outside the threads in places where I was pretty sure it would not cause problems, and began working on a completely different feature - I figured since this problem only appears sometimes I could at least make the test runs test something else as well… And lo and behold - within 30 minutes the error appeared the log message told me something interesting: “the ring buffer had 375 written entries of sample data - but no reads yet”.
OH! I knew exactly what line of code checked whether we should read samples or not… And that single if-statement read a sample_count variable (how many samples we had written to the output video), and if that was lower than the available samples it should read samples from the ring buffer and write to the output video and increment sample_count variable.
And since we now knew that we never wrote any data to the stream (since we had read zero times from the ring buffer) and that the ring buffer was full (375 being the max entries of 128 bytes samples it could hold) - the initial value of sample_count would have to be more than zero the first time that if statement was run… (at least in 5% of the runs we had)
This sounded like the good old “uninitialized integer assumed to be zero”-problem. I cannot understate how many times I have made this mistake and let it consume multiple days of debugging before getting to the eureka moment of realizing: yep, I was being stupid again.