Sftp becomes unresponsive during authentication

0 votes
asked Aug 6 by rondel (240 points)
retagged Aug 31 by rondel

We are running into strange issues periodically during the authentication process that results in an odd use case where users are unable to authenticate with the SFTP server fully. When this happens we have to restart the application altogether as it appears to not recover on its own.

I updated to the most recent versions of the Rebex.FileServer library but we're still able to reproduce the issue in our production environment. (.Net core 3.1)

<PackageReference Include="Rebex.Elliptic.Ed25519" Version="1.2.1" /> <PackageReference Include="Rebex.FileServer" Version="5.0.7840" />

Our code looks something like this during auth

public bool OnServerAuthentication(object sender, IAuthenticationEvent authEvent)
    {
        try
        {
            return tryAuthenticateUser(authEvent);
        }
        catch (Exception e)
        {
            _logger.LogError(e, $"Error authenticating user {authEvent.UserName}");
            authEvent.Reject();
            return false;
        }
    }

    private bool tryAuthenticateUser(IAuthenticationEvent authEvent)
    {

        var user = _userAuthenticationStrategy.AuthenticateUser(authEvent);

        if (user.IsAuthenticated)
        {
            _logger.LogInformation("User {email} is authenticated. Setting up file system...", user.Name); //auth succeeds to here
            _fileSystemResourceSecurity.SetUpFileSystemForUser(user); //deadlock happens here
            _logger.LogInformation("File system set up.  Telling user {email} that we are ready!", user.Name); //never reaches this step
            authEvent.Accept(user);
            return true;
        }

        authEvent.Reject();
        return false;
    }

The locking/deadlock appears to happen in this block of code

public void SetUpFileSystemForUser(AuthenticatedFileServerUser authenticatedFileServerUser)
    {
        var companyRootFolder = _rootDirectoryProvider.GetOrCreateCompanyRootDirectory(authenticatedFileServerUser);

        var restrictedFileSystem = _fileSystemProvider.CreateChildFileSystem(companyRootFolder);

        authenticatedFileServerUser.SetFileSystem(restrictedFileSystem);
    }

The behavior we see is that even though the auth portion succeeds, it never completes the FileServerUser.SetFileSystem and so the connection remains in the unauthenticated state. We can see that clients continue to connect, "successfully" authenticate but the file system setup portion remains hung and thus never accepts the authentication event. Unauthenticated connections continue to build up during this time until we restart the process.

We've been battling this issue on and off for months and so I figured it was worth asking to see if you all had any ideas where the source of this problem may be. My best guess is that there is some kind of race condition but this is starting to happen multiple times a week and it's getting a bit concerning.

The vast majority of our service registrations are singletons including the FileSystemProvider though I'm not sure that should be a problem. Is the FileSystemProvider designed to be thread safe or should I be creating and disposing of instances for each thread.

Thanks!


UPDATE

Our implementation is based on ReadWriteFileSystemProvider and uses Azure File Storage as the underlying storage. We share a single instance of this across threads (singleton). The actual GetChild method looks like this:

protected override NodeBase GetChild(string name, DirectoryNode parent)
    {
        var childPath = _fileStorageDelegate.GetChild(parent.Path, name);
        if (childPath == null)
        {
            return null;
        }
        if (childPath.Type == FileSystemItemType.Directory)
        {
            return new DirectoryNode(name, parent);
        }

        return new FileNode(name, parent);
    }

I am relatively certain that the underlying storage implementation is thread safe based on the documentation Microsoft provided: https://azure.github.io/azure-sdk/dotnet_introduction.html#dotnet-service-methods-thread-safety. We were on an older Azure storage library initially that I blamed for the apparent race condition but after getting up to date this still seems to be an intermittent issue.

I haven't been able to reproduce the race condition locally so there is an element of guess work involved. However I will see if I can potentially introduce a timeout during the GetChild call to see if perhaps we can throw an exception and recover if it doesn't respond after some amount of time.

Is it recommended to reuse the same FileSystem across threads or should that be created each time?

UPDATE 2

I was able to pinpoint the specific line where the hangs occur as

 var restrictedFileSystem = _fileSystemProvider.CreateChildFileSystem(companyRootFolder);

It appears that it may be the CreateChildFileSystem call taking in a Node that is hanging not necessarily the SetFileSystem portion. I had thrown in some additional logging to try to pinpoint the line number that was failing.

I am making the experimental change to skip the virtual root path check but just wanted to pass on this extra information in case it helps narrow things down a bit since I know we were looking at a different suspect.

Applies to: Rebex SFTP, File Server

3 Answers

+1 vote
answered Sep 6 by renestein (4,330 points)
selected Sep 10 by rondel
 
Best answer

1/2
Hi Rondel,
I will answer your questions about singletons and thread safety:

"The vast majority of our service registrations are singletons including the FileSystemProvider though I'm not sure that should be a problem. Is the FileSystemProvider designed to be thread safe or should I be creating and disposing of instances for each thread."

From the description of the system design you have provided, you are using the instances with the following lifecycles.
1) Singleton instance of the Azure SDK class(es).
2) Singleton instance of the AzureFileSystemProvider (instance of your class) decorated by the "child file system" (user session-scoped instance).
3) Singleton/Session scoped instance of the user.

Dependencies and its lifecycle (simplified):

AzureSdkClass{Singleton} <- AzureFileSystemProvider{Singleton} <- ChildFileSystem(user session-scoped ) <-FileServerUser {Singleton or user-session scoped}.

This design exhibits the following liabilities.
1) All requests are served by only one instance of the AzureFileSystemProvider. Every FileSystemProvider returned from the CreateChildFileSystem method is a decorator that only appends the path to the request and then forwards the request to the underlying provider. In our case to the instance of the AzureFileSystemProvider.

2) When this singleton instance of the AzureFileSystemProvider becomes corrupted or when some virtual file system operation is delayed/got stuck then all requests for all users might be affected.

So:
Rule 1) DON'T use a single instance of the file system provider for all requests on the SFTP server. Every rule has exceptions, but really don't try to share instances until it is necessary because the underlying virtual file system storage has special requirements.

Rule 2) PREFER user session-scoped file system provider. When one instance of the file system starts to behave badly, only one user session will be affected. All other user sessions will happily live in their own safe and by the fatal incident in the foreign session unaffected mini-world.

Concretely for your scenario.
Handle authentication events.

//Handle Authentication event to authenticate users
      _sftpServer.Authentication += (sender, args) => SftpServerAuthentication(sender, args);

      //Handle disconnected event to Dispose instance of a AzureFileServerUser.
      _sftpServer.Disconnected += SftpServerDisconnected;

SftpServerAuthentication event handler.

private static void SftpServerAuthentication(object sender, AuthenticationEventArgs e)
    {

      var currentUserName = e.UserName;

      //Load a user from database.
      var user = tryAuthenticateUser(currentUserName);

      //When a user is not in our database, deny access.
      if (user == null)
      {
        e.Reject();
        return;
      }

      //Verify user password
      if (!_passwordHashService.VerifyPassword(user, e.Password))
      {
        //When user sends invalid password, deny access.
        e.Reject();
        return;
      }


      //Create "user session scoped" instance of the AzureFileSystemProvider.
      var azureFileSystemProvider = new AzureFileSystemProvider(...);
      //Create instance of our AzureFileServerUser with user specific file system.
      var azureFileServerUser = new AzureFileServerUser(currentUserName, azureFileSystemProvider);

      //Authentication succeeded. Allow user logon.
      e.Accept(azureFileServerUser);
    }

SftpServerDisconnected event handler.

 /// <summary>
    /// Handler for the <see cref="Server.Disconnected"/> event.
    /// </summary>
    private static void SftpServerDisconnected(object sender, DisconnectedEventArgs e)
    {
      //Calling the Dispose method on the instance of the AzureFileServerUser
      //initiates disposing of the chain AzureFileServerUser->AzureFileSystemProvider
      //In other words,  all "user session scoped" services are disposed.
      if (e.User is AzureFileServerUser azureFileServerUser)
      {
        azureFileServerUser.Dispose();
      }
    }

And AzureFileServerUser might look like this.

/// <summary>
  /// Class represents SFTP user derived from the Rebex <see cref="FileServerUser"/>.
  /// Instances of this class are created after successful authentication.
  /// </summary>
  public class AzureFileServerUser : FileServerUser, IDisposable
  {
    //"user session scoped" AzureFileSystemProvider.
    private readonly AzureFileSystemProvider _provider;

    /// <summary>
    /// Initialized a new instance of the <see cref="AzureFileServerUser"/>.
    /// </summary>
    /// <param name="userName">User name</param>
    /// <param name="provider">"user session scoped" instance of the AzureFileSystemProvider</param>
    public AzureFileServerUser(string userName, AzureFileSystemProvider provider, string virtualRoot) : base(userName, String.Empty, provider, string virtualRoot)
    {
      _provider = provider ?? throw new ArgumentNullException(nameof(provider));
    }

    /// <summary>Performs application-defined tasks associated with freeing, releasing, or resetting unmanaged resources.</summary>
    public void Dispose()
    {
      Dispose(true);
    }


    /// <summary>
    /// Dispose "user session scoped" services.
    /// </summary>
    protected virtual void Dispose(bool disposing)
    {
      if (disposing)
      {
        //Dispose AzureFileSystemProvider.
        _provider.Dispose();
      }
    }
  }
}

I don't want to sound like the stuck record but let me reiterate.
FileSystemProvider with singleton lifecycle represents a single point of failure. And please be vigilant and in your case double-check guarantees in Azure SDK because you are still using Azure SDK classes with the singleton life cycle.

commented Sep 7 by rondel (240 points)
This makes sense to me. I was able to reproduce the deadlock again even with the experimental changes so I will go down the route of creating scoped instances of the file system provider. Thanks for the input
+1 vote
answered Aug 11 by Lukas Pokorny (126,910 points)

When FileServerUser.SetFileSystem(...) method is called, it actually checks for the existence of the virtual root path before exiting. This is done in order to prevent strange errors later on, caused by an authenticated user trying to access a non-existent directory.

However, the check involves an actual call to the provided virtual file system provider, and if that blocks, the SetFileSystem(...) call will block as well. So there indeed is a potential for introducing a deadlock rather easily - all it takes is for the provider's GetChild(...) method to block.

But at this point, we can't tell whether the deadlock is due to a bug in our code, a problem in any of your custom providers, or an issue caused by some interaction between _rootDirectoryProvider and _fileSystemProvider. Which provider types do you actually use in your application?

commented Aug 11 by rondel (240 points)
Thanks! I updated my question with some of the details requested. In the logs leading up to the issue, one of the more common patterns we see is the same user running through auth multiple times in quick succession leading to the file system being "initialized" for the same user in rapid succession. In targeted testing this hasn't reproduced the issue for me locally however
commented Aug 25 by Lukas Pokorny (126,910 points)
Thanks for the update. We have not been able to reproduce the issue yet, but we'll review ReadWriteFileSystemProvider internals and related code.

In the meantime, we added an experimental "SkipVirtualRootPathCheck" option to the latest release of Rebex File Server (R5.5) that makes it possible to disable root directory existence check in SetFileSystem(...) method:

    Rebex.Security.Cryptography.CryptoHelper.SetOption(authenticatedFileServerUser, "SkipVirtualRootPathCheck", true);
    authenticatedFileServerUser.SetFileSystem(restrictedFileSystem);

When enabled, SetFileSystem method will most likely no longer deadlock (but the issue might still occur later when the file system is actually being accessed). Please give this a try and let us know how it behaves - it might provide some clues about the reason.
commented Aug 25 by rondel (240 points)
Awesome! Thanks @lukas, we'll make this change and monitor over the next couple of weeks to see if it gets around the issue. I'll update here if this test is successful. Thanks!
commented Aug 30 by rondel (240 points)
edited Aug 31 by rondel
Still attempting to roll out the experimental change. But I did find some additional information on the line number that was failing
`CreateChildFileSystem` and not on the `SetFileSystem` line like we had assumed. (updated the question)

There is a licensing issue where I can't quite upgrade the package to latest without getting a new support license. I'm investigating that internally but it will take a bit longer to roll out the test
0 votes
answered Sep 6 by renestein (4,330 points)
edited Sep 7 by renestein

2/2
Now I can talk about "thread safety" guarantees in our file system providers. The term "thread safety" is overloaded. For our discussion, I will stick with the vague but intuitively meaningful definition "simultaneous calls from different threads are allowed and do not corrupt state".

The answer is complicated.
1) Simple rule. A custom file system provider (for example your AzureFileSystemProvider) is "thread-safe" (in a sense defined above) if and only if the code (your code) in your custom file system provider is thread-safe.
2) It follows from point 1) that our internal code that calls methods in the custom file system provider should be "thread-safe". And because internal operations must be in "critical" points synchronized and because we know nothing about the code in your custom provider, the synchronization across the threads can never be optimal. So there is one more reason why to use user-session scoped FileSystemProvider and avoid singleton provider - all requests in virtual file system between all users must be internally synchronized and when "critical point" is reached and one IO request got stuck then all IO requests might wait. This problem from the "external world" point of view manifests as "SFTP server does not make any progress and all requests are blocked".

I have written above that your provider should be thread-safe if you would like to support concurrent calls.

You are in charge now, but our library tries to be your non-violent, wise, and helpful guide.
For example:
NodeTimeInfo and many other classes are intentionally immutable classes.
If you want to change the name of the node (DirectoryNode, FileNode) you must create a new instance because this modification of the internal node state is not allowed.
When the state is not mutable then the state can be safely shared between threads.

Rename operation may look like the following code (pay attention to lines when new instances of the FileNode/DirectoryNode with a new name are created:
}

/// <summary>
/// Method renames given <paramref name="node" />.
/// Method returns new node that has <see cref="newName"/>.
/// Our rename operation is relatively expensive.
/// Rename is the composition of the following (non-transactional) basic operations.
/// 1) Create a container/blob with a new name.
/// 2) Copy blob content/blobs and its contents to blob/container with new name.
/// 3) Delete original container/blob.
/// </summary>.
protected override NodeBase Rename(NodeBase node, string newName)
{
  //Check that the node is valid.
  ThrowIfInvalidPath(node.Path);

  //If we are on the directory/container level.
  if (node.IsDirectory)
  {
    //Check that the directory is valid.
    ThrowIfNotDirectoryOnContainerLevel(node.Path);
    //Create BlobContainerClient for the old container.
    var oldContainer = GetContainerClientForNode(node);
    //Create BlobContainerClient for the new container.
    var newContainer = GetContainerClientForDirectoryName(newName);
    //Create new container
    newContainer.Create();
    //Copy all blobs including blob's content to the new container.
    CopyBlobs(oldContainer, newContainer);
    //Delete the old container.
    oldContainer.Delete();

    //Return representation of the renamed directory.
    return new DirectoryNode(newName, node.Parent);
  }

  //Check that the file is valid.
  ThrowIfNotFileOnBlobLevel(node.Path);
  //Create BlobClient for the old blob.
  var oldBlobNode = GetBlobClientForNode(node);
  //Create BlobClient for the new blob.
  var newBlobNode = GetBlobClientForNodePath(node.Path.ParentPath.AddPathPart(newName));
  //Copy the old blob content to the new (renamed) blob;
  CopyBlob(oldBlobNode, newBlobNode);
  //Delete the old blob.
  oldBlobNode.Delete();
  //Return representation of the renamed file.
  return new FileNode(newName, node.Parent);
}

And one important remark.
1) The virtual file system never checks if the file is already open/used and doesn't provide 'exclusive' access to the UNDERLYING FILE SYSTEM entry ('file - Azure blob', 'directory - Azure container'). If two or more SFTP requests access the same node (for example requests at the same time try to upload to the same underlying 'file (Azure blob)'))), you are entering the realm of undefined behavior and all bets are off.
Hope this helps.

...