I am using the collectFile() operator to produce a single output file from several “split” input files. I’d like this output filename to use the basename of the input files, and to be sorted based on a tuple key.
Example input file, named test.txt:
hi
hello
goodbye
farewell
The (non-working) code below uses splitText() to split test.txt into two files, each containing two lines (test.1.txt and test.2.txt). The process convert_to_csv then converts each test file to a csv (test.1.csv, test.2.csv). I’d like to use collectFile() to sort by the tuple key, combine the contents of both csv files, and write them to a file called test.csv by dynamically capturing the basename of these files.
In your simple example the files being fed in already have a newline character so the first solution is probably more appropriate and cleaner. If the input files in your real-world problem don’t include the newline character, you may prefer to use the second solution.
I would also like to shamelessly plug my nf-boost plugin which provides a “mergeText” function, essentially collectFile as a regular function instead of an operator:
You can use groupTuple and regular list sorting with mergeText to do what you would otherwise do, but it’s more flexible and I would argue easier to read and understand.
I hope to add mergeText to core Nextflow, but for now you can use the plugin if you’d like.
Thank you for your help! Your suggestion solved the issue of writing to a properly named file. However, I notice that the sort{ it[0] } closure in collectFile() doesn’t sort by the key in my tuple. Instead, the split csv files are written in the order of their completion by the convert_to_csv process. What am I doing wrong here?
Also, a bit of a tangential follow-up question:
Is there a way to deposit this collected file into an s3 bucket? I’m using publishDir in all of my processes to copy outputs to a user-specified s3 folder but the output of collectFile() gets deposited in the file system where the head job is running. It would be preferable to have all files written to one output location so the user doesn’t have to collect these from different places.