Renaming large files causing possible Fusion error

Our pipeline normally operates on paired FASTQ files with read lengths of 42bp and 26bp in R1 and R2 respectively.

Recently, we’ve used an alternative sequencing provider and now our R1 and R2 are 150bp in length. This greatly increases the size of our FASTQ files and downstream files like BAM files.

For larger samples, I am now consistently getting an error on our Platform-launched, Wave/Fusion, Batch / Amazon Compute Environment pipeline for any process that is renaming a large file I.e. using a mv command. E.g:

process initial_feature_count{
  tag "$sample_id"

  input:
  tuple val(sample_id), path(sorted_bam)
  path(gtf)

  output:
  tuple val(sample_id), path("${sample_id}.sortedByCoord.featureCounts.bam"), emit: feature_count_bam

  script:
  """
  # Run feature counts on the sorted STAR bam including strandedness and annotation of multimappers
  featureCounts -a $gtf -o ${sample_id}.star.featureCounts.gene.txt -R BAM $sorted_bam -T 4 -t transcript -g gene_id --fracOverlap 0.5 --extraAttributes gene_name -s 1 -M
  mv ${sample_id}.sortedByCoord.out.bam.featureCounts.bam ${sample_id}.sortedByCoord.featureCounts.bam
  """
}

In these cases, the file being renamed is approximately >5GB. The tasks are suitably provisioned for RAM and CPU.

An excerpt from the fusion logs looks like this:

{"level":"error","error":"operation error S3: UploadPartCopy, https response error StatusCode: 400, RequestID: 5YGM83ZNH2TGCS9C, HostID: gZhCR02LS3k1Ub663gYRtiu4T9NAjtKAOFBPuSL/Kq8XIab0D1ve+W6szfAVBWAtgS7QHLDxBEk=, api error InvalidArgument: Range specified is not valid for source object of size: 251658240","key":"scratch/16TqtvmXjo26kk/7b/9a79c2a440742c65d8a1a781d66974/Sample129_S129.sortedByCoord.featureCounts.bam.0.chk","bucket":"csg-tower-bucket","time":1720040365224810836,"caller":"utils.go:112","message":"at entries complete"}
{"level":"error","error":"operation error S3: UploadPartCopy, https response error StatusCode: 400, RequestID: 5YGN5H8XFJTKV06B, HostID: LFHRtuYPzQGrcVqRBCpczQY9+zsLrjNAGPDrTKZ15HZt0s29T7mPscS8B6CCzMG4L2x3PA4UkiY=, api error InvalidRequest: The specified copy range is invalid for the source object size","key":"scratch/16TqtvmXjo26kk/7b/9a79c2a440742c65d8a1a781d66974/Sample129_S129.sortedByCoord.featureCounts.bam.F000000.chk","bucket":"csg-tower-bucket","time":1720040365368268383,"caller":"utils.go:112","message":"at entries complete"}
{"level":"error","error":"operation error S3: UploadPartCopy, https response error StatusCode: 400, RequestID: 5YGNZEHS3KCAQKFF, HostID: +uzQR4wDUF/4+VTwKKD+aY7uUFUQm8pwDiKZOhWOrl15M42ix2GSkyut8qVoAVHll4QKWbiVbGs=, api error InvalidRequest: The specified copy range is invalid for the source object size","key":"scratch/16TqtvmXjo26kk/7b/9a79c2a440742c65d8a1a781d66974/Sample129_S129.sortedByCoord.featureCounts.bam.1E000000.chk","bucket":"csg-tower-bucket","time":1720040365512161407,"caller":"utils.go:112","message":"at entries complete"}
{"level":"error","error":"operation error S3: UploadPartCopy, https response error StatusCode: 400, RequestID: 5YGRQQ7JDGKRYEAA, HostID: 9Hl1n6wpWT9D7DWAVt2sdGYEOuz2BdMSK6b8E6januC4eDWAm0JJroZh0OvPtr95+bwLQ6T72V8=, api error InvalidArgument: Range specified is not valid for source object of size: 251658240","key":"scratch/16TqtvmXjo26kk/7b/9a79c2a440742c65d8a1a781d66974/Sample129_S129.sortedByCoord.featureCounts.bam.2D000000.chk","bucket":"csg-tower-bucket","time":1720040365654444895,"caller":"utils.go:112","message":"at entries complete"}
{"level":"error","error":"operation error S3: UploadPartCopy, https response error StatusCode: 400, RequestID: 5YGJP4N6ATBFY5K3, HostID: 4Qd6Kt8VhNyQ0D/T05TaFBglt/VsclmpK6NU/LDLYotJ4cs5lJAfawRcaldx3Ggn0vTuFZesI9E=, api error InvalidRequest: The specified copy range is invalid for the source object size","key":"scratch/16TqtvmXjo26kk/7b/9a79c2a440742c65d8a1a781d66974/Sample129_S129.sortedByCoord.featureCounts.bam.3C000000.chk","bucket":"csg-tower-bucket","time":1720040365796822564,"caller":"utils.go:112","message":"at entries complete"}
{"level":"error","error":"operation error S3: UploadPartCopy, https response error StatusCode: 400, RequestID: 5YGYDMHM4VCBEV7Q, HostID: py8YSd0hbz9X/1AxGaJiCeHEzm0fnYQc9+h/vWjGW0XOMuSi6vfbPfpwdgb8W6xrjU9bIzrCHHk=, api error InvalidRequest: The specified copy range is invalid for the source object size","key":"scratch/16TqtvmXjo26kk/7b/9a79c2a440742c65d8a1a781d66974/Sample129_S129.sortedByCoord.featureCounts.bam.4B000000.chk","bucket":"csg-tower-bucket","time":1720040365950307733,"caller":"utils.go:112","message":"at entries complete"}
{"level":"error","error":"operation error S3: UploadPartCopy, https response error StatusCode: 400, RequestID: 6TMFR6ADJT5STWCC, HostID: Gg8uZvZkauU4xlph4UrRAnJ1Ewfq7ENhSqBaUPfpPPPGv5/zX21MNjEXWGUZFLjbYN74Ow2TqhU=, api error InvalidRequest: The specified copy range is invalid for the source object size","key":"scratch/16TqtvmXjo26kk/7b/9a79c2a440742c65d8a1a781d66974/Sample129_S129.sortedByCoord.featureCounts.bam.5A000000.chk","bucket":"csg-tower-bucket","time":1720040366114178281,"caller":"utils.go:112","message":"at entries complete"}
{"level":"error","error":"operation error S3: UploadPartCopy, https response error StatusCode: 400, RequestID: 6TM6987B5R4N19HQ, HostID: +MGwLqDrYcwaBcNbHjREuc7ccuwbJJvQ1SOc/jB3V0V3bg62D/oTGmC9fusrGmtGIsorCVll7RQ=, api error InvalidRequest: The specified copy range is invalid for the source object size","key":"scratch/16TqtvmXjo26kk/7b/9a79c2a440742c65d8a1a781d66974/Sample129_S129.sortedByCoord.featureCounts.bam.69000000.chk","bucket":"csg-tower-bucket","time":1720040366235555129,"caller":"utils.go:112","message":"at entries complete"}
{"level":"error","error":"operation error S3: UploadPartCopy, https response error StatusCode: 400, RequestID: 6TMBJ6WSSPPC3RR6, HostID: hDkUDgqulU9/JOLLsF+u1NUc8BEko8oVFn+2fzN/LUT7IgURCby8GQdt0pr0KrOVJhOGV1Lp2yo=, api error InvalidArgument: Range specified is not valid for source object of size: 251658240","key":"scratch/16TqtvmXjo26kk/7b/9a79c2a440742c65d8a1a781d66974/Sample129_S129.sortedByCoord.featureCounts.bam.78000000.chk","bucket":"csg-tower-bucket","time":1720040366384261492,"caller":"utils.go:112","message":"at entries complete"}
{"level":"error","error":"operation error S3: UploadPartCopy, https response error StatusCode: 400, RequestID: 6TMCMZEDMPSHE1CE, HostID: bqI1QT2IkDGfY5SwS+C1+ELk1yOtJKMW0pkk7i41VVpv7hhnRKBXfuZ6uDGHOep+KSJYFGgl/wU=, api error InvalidArgument: Range specified is not valid for source object of size: 251658240","key":"scratch/16TqtvmXjo26kk/7b/9a79c2a440742c65d8a1a781d66974/Sample129_S129.sortedByCoord.featureCounts.bam.87000000.chk","bucket":"csg-tower-bucket","time":1720040366552076383,"caller":"utils.go:112","message":"at entries complete"}
{"level":"error","error":"operation error S3: UploadPartCopy, https response error StatusCode: 400, RequestID: 6TM28GTG4ZT5G8DX, HostID: pbUO9UKB43zn7+oUtJ4CpJeDsqAU42VZ6t3pg977J4cD1YjkjDc7/lIHIgR3BtNXpSCY2aEzskU=, api error InvalidArgument: Range specified is not valid for source object of size: 251658240","key":"scratch/16TqtvmXjo26kk/7b/9a79c2a440742c65d8a1a781d66974/Sample129_S129.sortedByCoord.featureCounts.bam.96000000.chk","bucket":"csg-tower-bucket","time":1720040366688332934,"caller":"utils.go:112","message":"at entries complete"}
{"level":"error","error":"operation error S3: UploadPartCopy, https response error StatusCode: 400, RequestID: 6TM1W0JV2P9KW1A1, HostID: QrGv4OsGxHJ3UjLzTq0vWow5kRB8g8kp9fas+QCR5Nhn7NXhVVkUBhrmZmmxRyPffYHCx/EbiKY=, api error InvalidRequest: The specified copy range is invalid for the source object size","key":"scratch/16TqtvmXjo26kk/7b/9a79c2a440742c65d8a1a781d66974/Sample129_S129.sortedByCoord.featureCounts.bam.A5000000.chk","bucket":"csg-tower-bucket","time":1720040366827712574,"caller":"utils.go:112","message":"at entries complete"}
{"level":"error","error":"operation error S3: UploadPartCopy, https response error StatusCode: 400, RequestID: 6TMEJ6KG7J7DGW8K, HostID: WtZfEwCaBEDkpk3b2koXIJXgI3yNyPawyFiFa2HAJJLDc6krmUng20vdj0+sRFBLNUJsedBeHbg=, api error InvalidRequest: The specified copy range is invalid for the source object size","key":"scratch/16TqtvmXjo26kk/7b/9a79c2a440742c65d8a1a781d66974/Sample129_S129.sortedByCoord.featureCounts.bam.B4000000.chk","bucket":"csg-tower-bucket","time":1720040366979928113,"caller":"utils.go:112","message":"at entries complete"}
{"level":"error","error":"operation error S3: UploadPartCopy, https response error StatusCode: 400, RequestID: RP1624QCYZ4C85AB, HostID: roEs8AZaJFo3wWxAuNMVxk8qVKhEwhrxA12g8jvUZpva5yFKl1rz4pXDyfPWK9CW9rc7mSI3shY=, api error InvalidRequest: The specified copy range is invalid for the source object size","key":"scratch/16TqtvmXjo26kk/7b/9a79c2a440742c65d8a1a781d66974/Sample129_S129.sortedByCoord.featureCounts.bam.C3000000.chk","bucket":"csg-tower-bucket","time":1720040367107262578,"caller":"utils.go:112","message":"at entries complete"}
{"level":"error","error":"operation error S3: UploadPartCopy, https response error StatusCode: 400, RequestID: RP17KQWKK97XQVMQ, HostID: TAf5Z2C4EO4s0TxHcOF+emJHgi627Ry5OMl0PVyiPAfcD2G9F0kpqTuFba5XkujWzqP2wposgUA=, api error InvalidRequest: The specified copy range is invalid for the source object size","key":"scratch/16TqtvmXjo26kk/7b/9a79c2a440742c65d8a1a781d66974/Sample129_S129.sortedByCoord.featureCounts.bam.D2000000.chk","bucket":"csg-tower-bucket","time":1720040367237245797,"caller":"utils.go:112","message":"at entries complete"}
{"level":"error","error":"operation error S3: UploadPartCopy, https response error StatusCode: 400, RequestID: RP15CVRR81PEDQ71, HostID: BMO89kHHdvyfs1KxZx5NOkfBDGaJm7vWOCGA3X7mIR5wizdANSvFYT+94HhkZDH+gVfIzqkrvlQ=, api error InvalidRequest: The specified copy range is invalid for the source object size","key":"scratch/16TqtvmXjo26kk/7b/9a79c2a440742c65d8a1a781d66974/Sample129_S129.sortedByCoord.featureCounts.bam.E1000000.chk","bucket":"csg-tower-bucket","time":1720040367502392537,"caller":"utils.go:112","message":"at entries complete"}

In some of the processes, I’ve been able to rewrite them so that the mv is not necessary. But in others, like the one above, the mv is the most sensible approach.

Anybody know why this is happening and how I can fix it?
Here’s the fusion log from one of the tasks in question.
fusion.txt (664.8 KB)

This seems to be the same issue: Using mv command in AWS environment with fusion

And I can confirm that replacing offending mv commands with cp, although not ideal, resolves the issue.

You don’t actually need to mv or rename the file here. It’s not adding anything to the process. Instead, just capture the original file, then rename it when you need to use it.

process initial_feature_count{
  tag "$sample_id"

  input:
  tuple val(sample_id), path(sorted_bam)
  path(gtf)

  output:
  tuple val(sample_id), path("${sample_id}.sortedByCoord.out.bam.featureCounts.bam"), emit: feature_count_bam

  script:
  """
  # Run feature counts on the sorted STAR bam including strandedness and annotation of multimappers
  featureCounts -a $gtf -o ${sample_id}.star.featureCounts.gene.txt -R BAM $sorted_bam -T 4 -t transcript -g gene_id --fracOverlap 0.5 --extraAttributes gene_name -s 1 -M
  """
}

We can rename it two ways. If we use it in a subsequent process, we can just change the input name at runtime:

process process_2 {
  tag "$sample_id"

  input:
  tuple val(sample_id), path("${sample_id}.sortedByCoord.featureCounts.bam")
  path(gtf)

  output:
  tuple val(sample_id), path("${sample_id}_files.txt"), emit: feature_count_bam

  script:
  """
  ls -lh ${sample_id}.sortedByCoord.featureCounts.bam > ${sample_id}_files.txt
  """
}

If you want to publish it, you can rename it with the saveAs option of publishDir:

publishDir "${params.outdir}, pattern: "*.sortedByCoord.out.bam.featureCounts.bam", saveAs: { filename -> "${sample_id}.sortedByCoord.featureCounts.bam" }

Doing this will save you expensive IO operations for renaming a file.

Here’s a miniature example for demonstration purposes:

process ECHO_1 {
    input:
        val sample
    output:
        tuple val(sample), path("${sample}_echo_1.txt"), emit: output

    script:
    """
    touch ${sample}_echo_1.txt
    """
}

process ECHO_2 {
    input:
        tuple val(sample), path("${sample}.txt")

    output:
        tuple val(sample), path("${sample}_echo_2.txt"), emit: output

    """
    cat ${sample}.txt > ${sample}_echo_2.txt
    """
}

workflow {
    input_ch = Channel.of("A", "B", "C")
    ECHO_1(input_ch)
    ECHO_2(ECHO_1.out.output)
    ECHO_2.out.output.view()
}
1 Like

Brilliant. Thank you!

Hi,

I have the same issue using the pipeline nf-core/cellranger v 2.7 while making a custom reference with cellranger aligner.

I seems that the star index is moved at the end by the cellranger executable and I have this same error in an AWSBATCH environment using fusion/wave.

Thx,