Data Races

I spent over two years learning about software engineering before I learnt about data races. Partly because I spent most of those two years working on the client side of the web, and also because I predominately wrote JavaScript, which prevented me from having problems with data race. When I eventually began writing Go, which can have multiple threads, I quickly came across it. Being a member of a small team working on FinTech solutions, especially in a microservices architecture, also meant I inevitably had to deal with this.

A data race occurs when multiple threads access the same piece of data, with at least one thread writing to the data, without proper synchronization. Take for example you were building a wallet system that allows money transfer. Before every transfer, you want to check that a user has enough money in their account to allow the transfer. After checking that they do, you will then write a new balance after completing the transfer, to show that they have spent from their previous balance. A programme with this logic seems reasonable. However, what happens when a user makes two transfers around the same time? In a language that has multiple threads or that supports concurrency, your server could end up processing the two transactions at the same time. Take for example, the initial balance was 50,000 naira. One transaction of 40,000 naira was made, and another of 30,000 naira was made. When your logic runs, it first checks if the balance is enough. If they run closely enough, the balance could be enough for both the 40,000 naira transaction and the 30,000 naira transaction. However, when the deduction is made, the balance would not be enough but there would be no way for your programme to know this. And so your programme inadvertently allows your customer’s balance to fall into the negative.

One way to think about this is that you store five computers for a customer. They can come for the computers at any time, usually sending an agent. They could even send multiple agents at the same time. When an agent comes, your employees can only either check the number of computers they have or give them the computers they requested, and there is a lag between both actions. So they first confirm that there are enough computers. If there are, they return to inform the agent of how many computers are stored, and then return to fetch the number of computers the agent wants. During that lag, after you have checked the number of computers and determined that they have enough computers to give from, another agent could come. Another employee of yours checks before any computers are removed, and they still have enough computers to give from. However, when one employee gives one agent four computers and that other person gives the other agent three computers, it turns out that you have given out of your own computers. While computer programmes often seem to run very first, there is usually still a very small sliver of time between each process. Therefore, this is a real possibility.

A single threaded language is unlikely to face this problem if your application is small enough and runs from only one server. This is because the default state is to run synchronously, with blocking actions that run one after the other. In JavaScript, the next programme will not begin to run until the first finishes running, unless you use the asynchronous feature. However, in a language like Go which can run processes concurrently, a data race is a possibility. Even in a synchronous language like JavaScript, when you begin to consider horizontal scalability with multiple servers, you have to consider a solution for data races in your application. Even in a single Nodejs server, you could have data races since Nodejs is generally non-blocking for separate requests.

One solution to data races is using a lock. A lock ensures that no other process attempts to read the data until the current process has completed its operation. In the example above with the computers, an analogous lock would be locking the room the computers are stored when you check if there are enough computers, and then not opening it until you have given out the needed computers. If someone else try to give out computers to another agent before you complete your operation, the door would be locked and they would need to wait for you to finish. Some languages let you create locks in your applications. One example of such a language is Go. It would look like so:

var m sync.Mutex
m.Lock()
// Carry out your operation
m.Unlock()
m.Lock()
// Carry out ANOTHER operation
m.Unlock()

The mutex lock in Go ensures that that when another operation tries to run and is unable to lock the variable m, it has to wait for the first operation to unlock it. In a real Go application, it might be better to use defer m.Unlock() to prevent a gridlock if the programme fails. And this is one main danger of locks. If a gridlock happens, all subsequent programmes are unable to run. This is one reason why professionals often recommend using queues instead of locks in some instances. Another reason that queues might generally be more favourable is that locks might become more complex to implement on the database level, rather than on the application level. When there are multiple servers, an application level lock is useless, as it will not prevent processes running on different servers.

A queue is a data structure that operates on a first in and first out principle. The first thing that gets into the queue is the first things that is removed (processed). It also ensures that since data in a queue is processed one after the other, there is no data race. A popular database for queues is RabbitMQ (and the only one I have used so far). All of your servers can dispatch the operations to the queue and you can have a worker process the actions on the queue. If you do not want an application level lock or you need a system wide solution, a queue could be the better option.