How to deal with ambguous outputs of a command-line process?

Hi,

I’m pretty new to Nextflow and nf-core and my ultimate goal is to contribute a simple module for a tool I use a lot that does not exist as a module yet (CellProfiler).

I’m trying to figure out how to deal with the fact that my tool has unpredictable outputs that depend on a user-supplied pipeline file. Depending on this file, many different extensions could be used for image and csv output. And very different folder structuring that ideally I would want to preserve. Is the only way to deal with this to include all possible file extensions as separate optional outputs?

Rebecca

Hi Rebecca,

yes this is generally what we do in nf-core. This also allows users of the module to explicitly access certain outputs for the next step unambiguously. Here is an example: modules/modules/nf-core/antismash/antismashlite/main.nf at eabe5808d97ccacdd694b9ce90af4bca47ddc54e · nf-core/modules · GitHub You can see that in some instances outputs are combined, where it makes sense.

Do you have a list of any possible output? I don’t know the tool itself to give some more detailed advice.

Hope this helps :slight_smile:

1 Like

Thanks for the reply and for the example!

The outputs in this case would typically be csvs, pngs, tifs, tiffs, SQLite and possibly txt. I understand accounting for the possibility of each extension and I think this is what I’ve gone with so far (still testing). The tool allows users to customize a file structure as well (e.g., /<image_name>/cells.csv). Is there any way to preserve this structure? If not, is there a way to standardize a known file structure given metadata from the samplesheet? For instance, if I want nextflow to process in parallel for each image set but store results grouped by the name of the plate and well the image came from (and assume this is also a column in the sample sheet), is there a way to easily do that?

Thanks again!
Rebecca

Hi Rebecca,

apologies for the delay.
The results directory organization is handled separately from the how data from the output directive is passed through channels.

We specify how the output of a process is published in the modules.config. You can publish it all in one location, split it into several depending on some condition, and split it up by some meta information, foe example.

Here are some examples:

For FastP here we publish the logs in one subdirectory of the results, and the trimmed FastQ files (if enabled) in another.

For Strelka here we publish in different subfolders based on the sample name, etc.>

So you can tinker with the results directory and where which files goes as needed.