Help with Parallelizing Sample Processing in Nextflow

Dear Nextflow Community,

I’m new to Nextflow and still learning. After going through the training, I tried building my own preprocessing workflow for scRNA-seq data in main.nf. My current workflow looks like this:

nextflow.enable.dsl=2

workflow {

    Channel
        .fromPath("${params.input_dir}/*", type: 'dir')
        .ifEmpty { error "❌ No sample directories found in: ${params.input_dir}" }
        .set { sample_dirs_ch }
    
    // Read metadata
    raw_ch = ReadMetadata(sample_dirs_ch)
}

// Process to read metadata from each sample
process ReadMetadata {
    tag "${input_file.getName()}"

    conda "${params.env_dir}/read_metadata.yaml"

    input:
    path input_file

    output:
    path "${input_file.simpleName}.raw.h5ad"

    publishDir "${params.output_dir}", mode: 'copy', overwrite: true

    script:
    """
    python ${workflow.projectDir}/scripts/read_metadata.py \
        --input ${input_file} \
        --output ${input_file.simpleName}.raw.h5ad
    """
}

This workflow works as expected, but the ReadMetadata process is quite slow because it processes each sample sequentially in my opinion (I don’t know how to check if it is not parallel inside the process).

I would like to parallelize this process so that each sample is processed independently, leveraging multiple CPU cores. Could someone guide me on the best way to achieve parallel execution for each sample in a process?

Thank you in advance for your help!

Best regards

Assuming that each sample has its own directory in params.input_dir (it’s confusing that you call the input to your ReadMetadataprocess input_file), the samples will be processed in parallel by Nextflow given that you have enough resources to do so. Assuming this is running in local mode, you don’t seem to specifiy any process resource requirement like memory or cpus, so Nextflow has no idea how many resources each task of the process needs. You might have a default for all processes somewhere in your config. But without properly setting the available and required system resources, Nextflow cannot effectively parallelize your tasks.

// nextflow.config
params {
    experiment = "msc"
    input_dir  = "data/${params.experiment}/data/base"
    output_dir = "data/${params.experiment}/data/temp/temp1"
    env_dir = "envs"
}

process {
    executor = 'local'
    memory   = '100 GB'
    time     = '100d'

    // Dynamically detect total CPUs
    cpus = Runtime.runtime.availableProcessors()
}

conda {
    enabled     = true
    useMicromamba = true
}

It was my error that i didn’t share my nextflow.config. Here i specified cpus and memory.

Ok, you are mixing up two things here: Available resources for the executor vs. resource requirement for process tasks. At the moment, you are telling Nextflow that every task needs 100 GB of memory and all available CPUs. Therefore, only one task can run at the same time. Process requirements are under the process scope, available resources under the executor scope: Configuration options — Nextflow documentation

Your config should therefore look like this:

process {
     executor = 'local'
     cpus =  1
     memory = 4.GB
     time = 1.h

executor {
    name = 'local'
    cpus = Runtime.runtime.availableProcessors() 
    memory = 100.GB

Oh, that’s embarrassing :sweat_smile:. Thank you so much, this is exactly the solution I needed!