Duplicate file names

ramirobarrantes · November 6, 2024, 6:01pm

I have a pipeline where I map the same file on two different indices, resulting in two different bam files - which I name the same but put on different directories. It looks something like this (i.e. the files have the same name, although in different directories):


bam_index1 = COPTR_MAP1(ch_reads,[[id:'index1'],file(params.index1)])
bam_index2 = COPTR_MAP2(ch_reads,[[id:'index2'],file(params.index2)])

bam_index1.bam.join(bam_index2.bam)
    .groupTuple(
         by: [0]
    )
    .map { result ->
        [result[0],[result[1][0],result[2][0]]]
    }
    .set{ch_merged}

The problem is that this looks something like this:

[[id:ERR10889327, single_end:false], [/gpfs1/home/r/b/rbarrant/projects/coptrPipeline/deleteme_work/e1/267a274ba0b7b92d2506879a156de5/ERR10889327.bam, /gpfs1/home/r/b/rbarrant/projects/coptrPipeline/deleteme_work/0c/ae859977b9990136da664363126406/ERR10889327.bam]]
[[id:ERR10889525, single_end:false], [/gpfs1/home/r/b/rbarrant/projects/coptrPipeline/deleteme_work/d e/57db29853127258e6afcad4c3acf5e/ERR10889525.bam, /gpfs1/home/r/b/rbarrant/projects/coptrPipeline/deleteme_work/62/1652359efc2ff4578a3bdbc060c040/ERR10889525.bam]]

And the subsequent program, called COPTR_MERGE which merges the bam files, doesn’t seem to like it if the names are the same:

ERROR ~ Error executing process > ‘COPTR_MERGE (2)’
Caused by:
Process COPTR_MERGE input file name collision – There are multiple input files for each of the following file names: ERR10889525.bam

In terms of best practices, how can this best be solved? Should I change the name on the previous process (COPTR_MAP) or inside the COPTR_MERGE one? Are there any examples of doing this? (i.e. if filenames are the same change them?).

I can think of a few options but don’t want my second nf-core module to look too ugly!! My current thought is that, within the COPTR_MERGE module, to iterate over all the bams and add a number to each file.

mribeirodantas · November 6, 2024, 9:08pm

You can solve this in many different ways, including in the next process where the name collision is happening. However, as this is a module and people will reuse it, I’d rather have it cleaned up in the module itself.

Based on that, I recommend you to not have files named the same way, even if they’re in different directories, if they’re going to be used by another process.

mahesh.binzerpanchal · November 7, 2024, 12:44pm

Ideally, modify COPTR_MERGE to stage each file in it’s own folder, as there’s no guarantee what the inputs will be like (Ideally the workflow developer will set prefix so the bams are uniquely named but one cannot rely on that). You do this with the stageAs: option in path.

See modules/modules/nf-core/biobambam/bamsormadup/main.nf at 033f2f25fa14ea81a4b93502d1dc6c2caf21cc92 · nf-core/modules · GitHub for an example.

ramirobarrantes · November 14, 2024, 12:44pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Problem grouping technical replicates Ask for help	3	89	June 3, 2024
Renaming large files causing possible Fusion error Ask for help nextflow , fusion , aws	4	133	July 12, 2024
Nextflow can't file file from combined Channel when running on running awsbatch executor Ask for help nextflow , aws	2	27	September 25, 2024
Grouping,mapping is shuffling the labels Ask for help nextflow	4	59	June 21, 2024
How to group appropriately Ask for help	1	65	June 5, 2024

Duplicate file names

Related topics