Missing RDS Missing output file(s)

Hello Fellow community members,

I’ve been encountering an error while attempting to run a Seurat R script within a Nextflow pipeline, and despite trying various approaches, I haven’t been able to resolve it. The error message states ‘Missing output file(s) seurat_analysis.rds expected by process runSeurat (1)’. However, I’ve manually verified that the RDS file is indeed generated in the defined outdir and is functioning correctly.

Here’s an overview of my scripts and the approaches I’ve tried:

  1. Calling R script from Nextflow:
#!/usr/bin/env nextflow
nextflow.enable.dsl=2

params.h5_file = "/res.h5"
params.outdir = "/results"
params.r_script = "/run_seurat_doubletfinder.R"

// Define the input channel
Channel.fromPath(params.h5_file).map { file ->
    def sample_name = file.getParent().getName()
    return tuple(sample_name, file)
}.set { h5_ch }

// Define the workflow
process runSeurat {
    publishDir "${params.outdir}", mode: 'copy', overwrite: true

    input:
    tuple val(sample_name), path(h5_file)

    output:
    tuple val(sample_name), path("seurat_analysis_${sample_name}.rds"), emit:rds_file

    script:
    """
    source /home/anaconda/etc/profile.d/conda.sh
    conda activate ir413

    Rscript ${params.r_script} ${h5_file} ${params.outdir}/seurat_analysis_${sample_name}.rds
    """
}

// Define workflow
workflow {
    h5_ch | runSeurat
}

My R script:

# Load necessary libraries

args <- commandArgs(trailingOnly = TRUE)
input_path <- args[1]
output_path <- args[2]

# Read the CellBender output data
data.file <- Read_CellBender_h5_Mat(file_name = input_path)

object <- CreateSeuratObject(counts = data.file, project = "seurat_project", min.cells = 3, min.features = 200)

# Some R codes...

# Save the Seurat object
saveRDS(object, file = output_path)
  1. Including code inside the pipeline:
#!/usr/bin/env nextflow
nextflow.enable.dsl=2

params.h5_file = "path/to/input_file.h5"
params.outdir = "path/to/output_directory"

// Define the input channel
Channel.fromPath(params.h5_file).map { file ->
    def sample_name = file.getParent().getName()
    return tuple(sample_name, file)
}.set { h5_ch }

// Define the workflow
process runSeurat {
    publishDir "${params.outdir}", mode: 'copy', overwrite: true

    input:
    tuple val(sample_name), path(h5_file)

    output:
    tuple val(sample_name), path("seurat_analysis_${sample_name}.rds"), emit:rds_file

    script:
    """
    source /path/to/anaconda3/etc/profile.d/conda.sh
    conda activate ir413

    Rscript -e \"

    # Load necessary libraries

    process_sample <- function(sample_name, h5_file, outdir) {
        cat('Processing sample:', sample_name, '\\n')

        # Read CellBender output
        data.file <- Read_CellBender_h5_Mat(file_name = h5_file)

        # Create Seurat object
        obj <- CreateSeuratObject(counts = data.file, project = sample_name, min.cells = 3, min.features = 200)

        # More processing codes...

        # Save the Seurat object
        rds_file_path <- file.path(outdir, paste0('seurat_analysis_', sample_name, '.rds'))
        saveRDS(obj, rds_file_path)

        if (file.exists(rds_file_path)) {
            cat('RDS file successfully saved at:', rds_file_path, '\\n')
        } else {
            cat('Failed to save RDS file at:', rds_file_path, '\\n')
        }
    }

    \"
    """
}

// Workflow definition
workflow {
    h5_ch | runSeurat
}

I do get message that RDS file successfully saved at’ given location.

I would really appreciate your help. I am unable to spot the cause of the error.

Thanks,
Sonal

Welcome to the community forum, @Sonal_Dahale :slight_smile:

In your output block, you’re telling Nextflow that it should expect an RDS file at the root of the task directory, but what seems to be happening is that you’re saving it somewhere else. If it’s a subdirectory within the task directory, you must provide the full subdir path in the line below.

If you’re saving this outside the task directory, in some meaningful location to you in the computer, you must not do it this way. In this case, you should use the publishDir process directive to have a result file in a meaningful location to you, without breaking the pipeline by storing tasks outputs outside the task work directory (read more about publishDir here).

A few more tips and best practices for your pipeline:

  1. Ideally, you shouldn’t source files or activate conda environments in the script block. For sourcing files and similar tasks, we have a process directive called beforeScript. You can read more about it here. The example in the documentation mentions specifically the source command :smiley:
  2. For handling conda environments, you should use the conda process directive. You can read more about it here. You can both use it to install packages and to activate already created conda environments.
  3. Instead of using Rscript, I would rather have the R script in the bin folder of the pipeline, starting the R script with a shebang (something like #!/usr/bin/env Rscript). If you want to have it pasted in the pipeline script file, you can also use the template keyword so that it’s easier to read the script. More about template here.

You may ask why use these Nextflow features and best practices if you can do it manually and it still works. By using Nextflow process directives, you benefit from abstractions that contribute to the scalability, portability, and reproducibility of your pipeline.

1 Like

Hello Marcel,

Thanks for your prompt reply and detailed explanations. I will edit my script accordingly and will let you know.

Kind regards,
Sonal

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.