Deadlock in "Rebex.FileTransfer" library Version="5.0.7290"

0 votes
asked Aug 25 by oprakash (120 points)

strong textWe are using "Rebex.FileTransfer" library Version="5.0.7290" in our product. We have encountered a deadlock in production due to this library.

Could you please confirm following;

  • Is there a possibility of Deadloack in Rebex generated threads running stack cited below?

    • Since we did not see this deadlock during 2 years of usage; so we would like to know what special conditions may have caused this deadlock now.
    • What is your suggestion of mitigating this issue? Also please let us know if this issue is fixed in version later than 5.0.7290.

Please let us know if you need more info.

Observations from memory dump.
0:006> !dlk
Examining SyncBlocks...
Scanning for ReaderWriterLock(Slim) instances...
Scanning for holders of ReaderWriterLock locks...
Scanning for holders of ReaderWriterLockSlim locks...
Examining CriticalSections...
Scanning for threads waiting on SyncBlocks...
Scanning for threads waiting on ReaderWriterLock locks...
Scanning for threads waiting on ReaderWriterLocksSlim locks...
Scanning for threads waiting on CriticalSections...
CLR thread 0x3c holds the lock on SyncBlock 000001d2d5ec69f0 OBJ:000001d2bc7a67c0[System.Object]
...and is waiting for the lock on SyncBlock 000001d2d5ff2c78 OBJ:000001d2bc7a67f0[System.Object]
CLR thread 0x50 holds the lock on SyncBlock 000001d2d5ff2c78 OBJ:000001d2bc7a67f0[System.Object]
...and is waiting for the lock on SyncBlock 000001d2d5ec69f0 OBJ:000001d2bc7a67c0[System.Object]
CLR Thread 0x3c is waiting at Rebex.Net.SshChannel.bksd(jnrr)(+0x17 IL,+0x42 Native)
CLR Thread 0x50 is waiting at Rebex.Net.SshChannel.bkry()(+0x17 IL,+0x40 Native)

1 deadlock detected.

The threads involved in deadlocks are thread 11 & 13 they are running following call stack.

0:013> !syncblk
Index SyncBlock MonitorHeld Recursion Owning Thread Info SyncBlock Owner
361 000001d2d5ec69f0 3 1 000001d2d60adc60 958 11 000001d2bc7a67c0 System.Object
362 000001d2d5ff2c78 3 1 000001d2d629e110 2c54 13 000001d2bc7a67f0 System.Object

364 000001d2d5ec6788 1 1 000001d2d629e110 2c54 13 000001d2bc7a21a0 System.Collections.Generic.List`1[[System.Threading.ManualResetEvent, System.Private.CoreLib]]

Total 427
ComClassFactory 0
Free 377

0:011> !clrstack
OS Thread Id: 0x958 (11)
Child SP IP Call Site
000000f076bfe898 00007ff985f667c4 [GCFrame: 000000f076bfe898]
000000f076bfea38 00007ff985f667c4 [GCFrame: 000000f076bfea38]
000000f076bfea88 00007ff985f667c4 [HelperMethodFrame1OBJ: 000000f076bfea88] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
000000f076bfebe0 00007ff9078c2b32 Rebex.Net.SshChannel.bksd(jnrr)

000000f076bfec40 00007ff9078c4991 Rebex.Net.SshChannel.bkso(Byte[], Int32, Int32)
000000f076bfecd0 00007ff9078c46cb Rebex.Net.SshChannel.Send(Byte[], Int32, Int32)
000000f076bfed10 00007ff9078c445e gldk.iier(jnrr, glcu)
000000f076bfed90 00007ff9078c429c gldk.iidq(Rebex.Net.SftpServerType, glcu)
000000f076bfee10 00007ff9078c071d Rebex.Net.Sftp.qjjf(glcu)
000000f076bfee70 00007ff9078bc0d3 Rebex.Net.Sftp.qjjd(System.String, System.String, Rebex.Net.SshPrivateKey, glcu)
000000f076bfeef0 00007ff9078bbcb4 Rebex.Net.Sftp.Login(System.String, System.String)
000000f076bfef30 00007ff9078bbb63 Rebex.Net.FileTransferClient.Login(System.String, System.String)
000000f076bfef70 00007ff9076fbd64 Airwatch.PushRelayServer.RebexImpl.PushRelayServerConnectionRebexImpl.Connect(Int32)
000000f076bfefd0 00007ff9076fb871 AirWatch.PullService.Business.PushRelayServerBusiness+c
000000f076bff020 00007ff949b2d409 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [E:\A_work\694\s\src\mscorlib\shared\System\Threading\ExecutionContext.cs @ 167]
000000f076bff0a0 00007ff949be13e7 System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef) [E:\A_work\694\s\src\mscorlib\src\System\Threading\Tasks\Task.cs @ 2440]
000000f076bff140 00007ff949c1a9ec System.Threading.ThreadPoolWorkQueue.Dispatch() [E:\A_work\694\s\src\mscorlib\src\System\Threading\ThreadPool.cs @ 588]
000000f076bff5c0 00007ff966a93053 [DebuggerU2MCatchHandlerFrame: 000000f076bff5c0]
0:011> ~13s
00007ff9`85f667c4 c3 ret
0:013> !clrstack
OS Thread Id: 0x2c54 (13)
Child SP IP Call Site
000000f08d9fe428 00007ff985f667c4 [GCFrame: 000000f08d9fe428]
000000f08d9fe5c8 00007ff985f667c4 [GCFrame: 000000f08d9fe5c8]
000000f08d9fe618 00007ff985f667c4 [HelperMethodFrame1OBJ: 000000f08d9fe618] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
000000f08d9fe770 00007ff9078fe1d0 Rebex.Net.SshChannel.bkry()

000000f08d9fe7d0 00007ff9078c1c87 Rebex.Net.SshChannel.bkrt(Byte[], Int32, Int32)
000000f08d9fe860 00007ff90772ac30 Rebex.Net.SshSession.exvy(Byte[], Int32, Int32)
000000f08d9fe920 00007ff907725f0f Rebex.Net.SshSession.exvx()
000000f08d9fea10 00007ff907726914 Rebex.Net.SshSession.xaxv.exwv()
000000f08d9fea80 00007ff90772658b xaxw+ndag.bnom()
000000f08d9feab0 00007ff907726335 xaxw+ndag.bnon()
000000f08d9feaf0 00007ff96f627080 System.Threading.Thread.ThreadMain
ThreadStart() [E:\A_work\180\s\corefx\src\System.Threading.Thread\src\System\Threading\Thread.cs @ 93]
000000f08d9feb20 00007ff949b2d409 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [E:\A_work\694\s\src\mscorlib\shared\System\Threading\ExecutionContext.cs @ 167]
000000f08d9fed78 00007ff966a93053 [GCFrame: 000000f08d9fed78]
000000f08d9ff000 00007ff966a93053 [DebuggerU2MCatchHandlerFrame: 000000f08d9ff000]

1 Answer

0 votes
answered Aug 25 by Lukas Pokorny (124,570 points)

We are aware of two issues in v5.0.7290 that could cause a deadlock. In most scenarios, these deadlocks were rare, so they went unnoticed for a long time.

in v5.0.7320, we fixed an issue that could occasionally cause a deadlock during SSH negotiation, In v5.0.7357, we fixed another deadlock (that could occur when processing incoming EOF packet while waiting for remote receive buffer size to increase).

Both were located in Rebex.Net.SshChannel class internals, and both only occurred rarely when specific race conditions have been triggered due to timing of the threads involved.

Would it be possible for you to try the current version?

commented Aug 25 by oprakash (120 points)
edited Aug 25 by oprakash
The deadlock during SSH negotiation that you cited was also reported by us.

By the looks of it; this deadlock at a different location. This needs to be thoroughly  investigated. Could you please have your developers analyze this and provide a mitigation for this issue ?

Trying / Upgrading is not an option for us because this deadlock lead to a high visibility customer escalation for us. We need to be sure that the deadlock is fixed and do not occur again.
commented Aug 25 by oprakash (120 points)
edited Aug 25 by oprakash
In a matter of 2 weeks we have seen this issue happen thrice; that makes us believe that this issue is not so rare. We would also want a comment from your team on what special condition could make it so frequent.

Upgrading library is not an option; it would invalidate our otherwise stable code which was running fine for 2 years. Are there any other workarounds to avoid deadlock?

A hotfix would be ideal.
commented Aug 25 by Lukas Pokorny (124,570 points)
Based on further investigation, we can now confirm that you are right - the deadlock you reported today is indeed at a different location than the previous one (which we fixed in v5.0.7320). Fortunately, the original name of SshChannel.bkry() is SshChannel.SendEofInternal(), and this is the location of the second deadlock, the one we we fixed in v5.0.7357.

The rarity of the deadlock's occurrence is relative. I'm sure there are scenarios where it is very common, and scenarios where it's very rare. We actually discovered this issue ourselves because it caused some of our automated tests to fail occasionally after migrating to a new hardware.

It only manifests when very specific conditions were met, and not consistently, which is common with these kinds of issues - the exact timing of code execution on different threads is not at all deterministic, and the same applies for network communication and SSH's TCP packets. All it takes for a hidden deadlock to occur is for a TCP socket send to take a little bit more or a little bit less time than usual.

Anyway, this comes from the issues description: When the thread processing SshChannel.OnPacketReceive(...) calls SendEofInternal() upon receiving SSH_MSG_CHANNEL_EOF, we already have acquired a lock _sendSyncPacket, and we attempt to acquire lock for _syncSend as well within SendEofInternal(). However, if at that point another thread is waiting inside SshChannel.SendInternal(...) for the remote side to increase its receive window size, a deadlock occurs, because SendInternal keeps a lock for _syncSend while waiting. OnPacket waits for SendInternal to free the _syncSend lock, while SendInternal waits for OnPacketReceive to process the remote side's packet informing about the receive window size adjustment.

So upgrading to v5.0.7357 ( or later would indeed resolve this issue. There is currently no other mitigation for it. We can prepare a custom hotfix based on v5.0.7290 if you prefer not to upgrade to v5.0.7357 (or later) yet.
commented Aug 25 by oprakash (120 points)
Thank your for this detailed explanation.

At this moment we would like to get a hotfix based on v5.0.7290 to move forward.

We understand suggestion is to upgrade to v5.0.7357 (or later) yet,  We will have to plan and take this up later.
commented Aug 25 by mholla (100 points)
Adding to what oprakash mentioned above, for reducing the possibility of deadlock, does it help if we reduce the number of threads we create for outbound FTP connections?

We are trying to find workaround for immediate resolution till we receive the hotfix etc.
commented Aug 26 by Lukas Pokorny (124,570 points)
I just sent a link to the hotfix to your email addresses.

Reducing the number of threads might decrease the likelihood of a deadlock. When the second deadlock used to occur in our automated tests, it was usually triggered by high load (multiple different test suites running at the same time).
commented Aug 26 by oprakash (120 points)
Thanks for a quick turn around like always :-). Support experience from Rebex has always been best.

Unfortunately, we cannot use DLL reference in our CI process; this has to be a nuget. Could you please provide us with a nuget instead of dll. We should have mentioned it before to save each other's time.

Thanks again.
commented Aug 26 by Lukas Pokorny (124,570 points)
OK, check out the latest email!