Improve performance of GetFiles with a wildcard on a directory with a large number of files

0 votes
asked Aug 30 by gdinunzio (120 points)

I am noticing that GetList is somewhat slow on directories that contain a large amount of files. I am currently getting a list of how many matches there are in the remote directory before i decide weather to do multiple file transfer or a single file transfer inside of a for loop.

Is there a way to combine the GetList (with a wildcard) with a Linq expression containing a sort and limit on how many items to return?

Is there any other way to improve performance?

Here is the code

//Check if remote directory exists and has at least one file
fileCount = client.GetList(remoteSearch).Count;
if (client.DirectoryExists(remoteFolder) && fileCount > 0)
{
     //Check if there are more than 20 files to download
     if (fileCount > 20)
     {
          //download oldest 20 files in folder that match criteria                                                                                                
          var sftpItems = client.GetList(remoteSearch).Cast<SftpItem>()
               .Where(w => w.IsFile)
               .OrderBy(o => o.LastWriteTime)
               .Take(20);

          foreach (SftpItem item in sftpItems)
          {
               try
               {
                    client.Download(item.Path, tempDirectory, TraversalMode.MatchFilesShallow,                                TransferMethod.Move, ActionOnExistingFiles.OverwriteAll);
               }
               catch (Exception ex)
               {
                    throw ex;
               }                            
          }
     }
     else
     {
          //download all files in folder that match criteria                        
          client.Download(remoteSearch, tempDirectory, TraversalMode.MatchFilesShallow,
                        TransferMethod.Move, ActionOnExistingFiles.OverwriteAll);                        
     }                      
}
Applies to: Rebex SFTP

2 Answers

0 votes
answered Aug 31 by Lukas Matyska (47,270 points)

In SFTP all filtering is performed on client side.

Directory listing in SFTP is performed by sending (RFC draft):

  • SSH_FXP_OPENDIR(id, path)
  • SSH_FXP_READDIR(id, handle)

There is no such a thing like: SSH_FXP_READDIR(id, handle, filter).

Therefore it is very suspicious, that Download(remoteSearch, TraversalMode.MatchFilesShallow) should be slower than GetList(remoteSearch) + download in loop. Because it is basically the same process in both cases.

You can compare logs for both cases. You will see, that single Download() should be even sightly faster than GetList() + download in loop (when using TraversalMode.MatchFilesShallow and your code above).

commented Aug 31 by gdinunzio (120 points)
I apologize but my initial question may have been misleading...

I am getting good performance in sorting and filtering a list of files to download using linq:

//Get a list of oldest files to download matching the criteria
//presently i am looking of a max of 20 files

int intMaxDownload = 20;
  var filesToDownload =
                    client.GetList(remoteSearch).Cast<SftpItem>()                    
                    .Where(w => w.IsFile)
                    .OrderBy(o => o.LastWriteTime)
                    .Take(intMaxDownload);

What i am having trouble with is how (if possible) to convert the filtered list created to a list that can be used with the sftp.download function?

 Should i convert the itemcollection to a fileset?  If so is that possible using LINQ or using a foreach loop?  Of is there a way to filter, sort and download in one step?

Thanks
Gianluca
0 votes
answered Aug 31 by Lukas Matyska (47,270 points)
edited Sep 3 by Lukas Matyska

Note: this is reaction to this comment.

I am sorry. I got stuck with question title "Improve performance of GetFiles ..."

It is possible to filter files on the fly using ListItemReceived event or by rewriting FileSet.IsMatch() method. However, non of them can be used in case 'oldest 20 files', because this condition requires to iterate through whole list and filter at the end of the process (not during the process). This is currently not possible.

So, my suggestion for your task is:

  • filter items using GetList(remoteSearch)
  • select the oldest 20 files by LINQ
  • create the FileSet for those files
  • download files based on the created FileSet

The code can look like this:

//download oldest 20 files in folder that match criteria                                                                                                
var sftpItems = client.GetList(remoteSearch).Cast<SftpItem>()
        .Where(w => w.IsFile)
        .OrderBy(o => o.LastWriteTime)
        .Take(20);

var set = new FileSet(".");
foreach (SftpItem item in sftpItems)
{
    set.Include(item.Name, TraversalMode.NonRecursive);
}

client.Download(set, tempDirectory, TransferMethod.Move, ActionOnExistingFiles.OverwriteAll);

This solution is not the most optimal. It has to iterate through remote directory two times (firstly for GetList(remoteSearch) and secondly for Download(set)).

This overhead can be eliminated only by not using Download(set). Instead you can call client.GetFile() for each item in loop. However, using 'GetFile()' solution, you cannot use TransferMethod.Move and ActionOnExistingFiles.OverwriteAll. So, you would need to handle it for yourself.

...