Distribution uses TCP as the underlying network protocol. In general, TCP provides reliable connectivity between machines on a network. However, it is possible that network errors can occur that cause a TCP connection to drop. When a TCP connection is dropped, requests and responses between nodes participating in a distributed transaction are not received. Network errors are detected by the keep-alive protocol described in the section called “Detecting failed nodes” and handled by the distributed transaction protocol.
Network connectivity failures are caused by:
A non-response keep alive timeout occurring.
TCP retry timers expiring.
Lost routes to remote machines.
These errors are usually caused by network cables being disconnected, router crashes, or machine interfaces being disabled.
As discussed in the section called “Local and distributed transactions”, all distributed transactions have a transaction initiator that acts as the transaction coordinator. The transaction initiator can detect network failures when sending a request, or reading a response from a remote node. When the transaction initiator detects a network failure, the transaction is rolled back. Other nodes in a distributed transaction can also detect network failures. When this happens, rollback is returned to the transaction initiator, and again the transaction initiator rolls back the transaction. This is shown in Figure6.4, “Connection failure handling”.
Figure6.4.Connection failure handling
When the transaction initiator performs a rollback because of a connection failure - either detected by the initiator or another node in the distributed transaction, the rollback is sent to all known nodes. Known nodes are those that were located using location discovery (see the section called “Location discovery”). This must be done because the initiator does not know which nodes are participating in the distributed transaction. Notice that a rollback is sent to all known nodes in Figure6.4, “Connection failure handling”. The rollback is retried until network connectivity is restored to all nodes.
Transaction rollback is synchronized to ensure that the transaction is safely aborted on all participating nodes, no matter the current node state.