The art of debugging – Story of expert problem solving process
I have always loved to solve problems. Even as a kid I wanted to find a way out of the maze or spot the differences between pictures. The feeling of success is intoxicating and it doesn’t vanish when you grow up. The problems just evolve to be a little bit different, and usually harder. Sometimes you really have to be a detective to figure them out.
“The impossible could not have happened, therefore the impossible must be possible in spite of appearances.”Agatha Christie, Murder on the Orient Express, Hercule Poirot
Most of the coding mistakes are small like typing mistakes or syntax errors which can be fixed immediately but then there are these more complex bugs you really have to dig in to find the solution. The thought process behind solving these kinds of problems have always interested me and here, in the following paragraphs, I will describe my thought process during solving one tricky problem.
The task was quite simple: To create a process where you read data from Apache Kafka, an open source event streaming platform, process the data and write it to the postgres/postGIS database using an FME Desktop by Safe Software.
FME is a data integration platform where you drag-and-drop transformers (ready-made data transformation blocks) to build and automate custom workflows without having to code. Of course if you want to code it is possible since FME supports for example Python.
The basic FME workspace was quickly done and even the KafkaConnector worked without any problems. The data seemed just as it should be. Below is a picture of a simplified version of the FME workflow which should work without any problems and with most of you it does work perfectly. I will explain later why it works straight out of the box with some of you (I don’t want to spoil the solution yet).
For me the problem came when the Timestamp was being transformed into ISO datetime. Here is the copied error message (yes, occcurred is written with three Cs… and that is not the problem I am solving here…):
|WARN |DateTimeConverter: An internal error has occcured. Please visit http://www.safe.com/support
|WARN |DateTimeConverter: Failed to evaluate expression ‘@DateTimeParse(1615874887.596,%s$,repair)’. Result is set to null
The solving process
My first thought was that, OK…, the DateTimeConverter cannot use millisecond since it works fine with only seconds, but really I need those milliseconds so yeah I can work around this and just use math and string functions to transform 1615874887.596 to 2021-03-16T06:08:07.596Z. No biggie…
Until you try to write the nice new Timestamp to the postgres/postGIS database and the similar error occurs… Postgres/postGIS database timestampz format supports milliseconds so it shouldn’t be a problem.
I tried every trick and transformers I could think of. Seconds from epoch works fine, but as soon as the milliseconds are added, it fails.
The weirdest thing was that if I enabled Feature Caching and ran the workspace in two parts (first one ending before the DateTimeConverter and then the rest) the Timestamp worked as it should. It just didn’t make any sense. If I run the whole thing at once even with Feature Caching enabled it fails but running it in two parts works? Why?
Then I decided to create the same Timestamp (1615874887.596) from scratch with AttributeCreator just to test the DateTimeConverter since the FME documentation says it should support milliseconds. Below is a picture of the workflow and surprise surprise… it worked as it should be whereas the parallel Timestamp coming from the KafkaConnector gives the result of null. This narrowed the problem into the KafkaConnector.
But why does the KafkaConnector Timestamp fail? Could it be that the data type of the Timestamp is wrong? Further scrutinizing the data revealed that the Timestamp coming from the JSONExtractor is a string of UTF-8. A small joy quickly turned to sorrow when changing it to 64 bit real didn’t work.
What if the json coming from the KafkaConnector isn’t UTF-8 but something else and it wrongly assumes it to be UTF-8?
At this point I decided to contact the local FME support at Spatialworld. They gave me good suggestions. We even set up a meeting where I showed the problem and sadly we couldn’t reproduce the problem with their systems so they could have continued the debugging process. There evidently was a bug but it seemed only to show up on my computer. They decided to send a support request to the main office anyway with a detailed explanation of the problem.
I didn’t want to give up. The bug was coming to my dreams and it haunted me. So I continued my own detective work.
A small clue came when I decided to unify all the Timestamps coming from Kafka to show 3 decimal places (example 1615874887.1 → 1615874887.100). I used stringFormatter (with a format string of .3f) to do this and I couldn’t believe my eyes when I saw the results. The Timestamp from AttributeCreator was formatted to 1615874887.596 as it should be but the Timestamp from KafkaConnector showed 1615874887,596. What is that? A comma? Shouldn’t it be a point!? In this case the appearance really does deceive. No wonder the DateTimeConverter and postgres/postGIS writer couldn’t “evaluate the expression”.
So it had to do with encoding or something similar and the KafkaConnector was the source of all evil.
For a long time I tried to find a way to transform the Timestamp coming from the KafkaConnector to the correct format but without any results. At some point of despair I decided to write my own KafkaConnector with Python using the PythonCaller. At least then I could really be in control of what kind of data and in which form it comes out of the Kafka.
My surprise was immense when I got the Python script done and exactly the same error message screamed at me in red in the Translation Log. The Python script had the same bug as the KafkaConnector! And even the comma plagued the formatted Timestamp. This can’t be real?
What was so profoundly different with my computer that it transformed all the decimal points to commas? Encoding? No, it wasn’t that, I tested it with the PythonCaller… What else affects the decimal separation marks? And then it hit me… Locale, System Locale.
So I printed the current locale with Python:
And the response was:
| 2.5| 0.0|INFORM|(‘en_FI’, ‘cp1252’)
| 2.5| 0.0|INFORM|(‘English_Finland’, ‘1252’)
| 2.5| 0.0|INFORM|cp1252
To test my theory I changed the locale with Python to the British ones:
And it worked. It worked as it should have been all along. Many hours of wondering the root cause of the bug was finally there in my grasp – The Finnish system locales.
The summary of all the clues
- When using only seconds from epoch (1615874887) → works
- When milliseconds are used (1615874887.596) → fails on both DateTimeConverter and writing to Postgres/postGIS
- Postgres/postGIS database supports milliseconds in timestampz format
- If you enable Feature Caching and run the workspace in two parts (the first run ends before the DateTimeConverter and then continue the run normally to the end) → works
- If you run the whole workspace with one go (even with Feature Caching enabled) → fails
- FME desktop DateTimeConverter supports milliseconds
- Timestamp created with AttributeCreator → works
- KafkaConnector seems to be the problem but why?
- Changing the datatype from string UTF-8 to 64 bit real → fails
- StringFormatter (.3f) formats the Timestamp coming from the KafkaConnector to 1615874887,596 and from AttributeCreator to 1615874887.596
- Self made Python version of the KafkaConnector gives the same error message and comma (not point) as a decimal separator
- In the FME Cloud server the workspace worked as intended
Sometimes it is not that fortunate to live in the world’s happiest country. This is not the first time I have fought with the Finnish system locales. I should add it to the list of “check these if you have a weird bug” which already contains encoding (because of Finnish ä, ö, å), wrong apostrophes (“, ‘, `,…) and wrong boolean logic.
As I mentioned at the beginning of this blog this bug does not materialize with everyone. It only shows up when you use a computer with a system locale that uses comma as a decimal separation. Usually the locale goes hand in hand with the country you live in. Here is a list of countries which uses the decimal comma: https://en.wikipedia.org/wiki/Decimal_separator#Countries_using_decimal_comma.
I want to thank Spatialworld FME support team for answering my tough questions and listening to my bug explanations.
How does your debugging thought flow go?