Nextflow 16sHiSeq workflow in a SGE cluster

Hi All

I am having a bit of trouble analyzing my PacBio Kinnex full length 16s rRNA data. The data I received from the sequencing center were already demultiplexed with primers removed. I tried pivoting to the BacBio HifI workflow which uses nextflow. The benefit here being that you can skip the cut-adapter step and the pipeline is recommend for this type of data. However, running this on the cluster has been problematic. The cluster uses SGE and my job_script might be giving problem as this is my first time working on a cluster. The script in question after trying to troubleshoot for the past month (I also attached the nextflow.config file):

`#!/bin/bash
#$ -N HiFi16SJob
#$ -cwd
#$ -pe smp 64
#$ -l h_vmem=128G
#$ -q bigmem
#$ -j y

# Initialize conda in the current shell environment
__conda_setup="$('/home/ICE/jbeer/anaconda3/bin/conda' 'shell.bash' 'hook' 2> /dev/null)"
if [ $? -eq 0 ]; then
            eval "$__conda_setup"
    else
                if [ -f "/home/ICE/jbeer/anaconda3/etc/profile.d/conda.sh" ]; then
                                . "/home/ICE/jbeer/anaconda3/etc/profile.d/conda.sh"
                                    else
                                                    export PATH="/home/ICE/jbeer/anaconda3/bin:$PATH"
                                                        fi
                                                fi
                                                unset __conda_setup

                                                # Activate the conda environment
                                                conda activate nextflow

                                                # Change to the directory containing your Nextflow pipeline
                                                cd /home/ICE/jbeer/pb-16S-nf

                                                # Run Nextflow with the main.nf script and specify the input data, metadata, and the skip_primer_trim parameter
                                                nextflow run main.nf \
                                                --input /home/ICE/jbeer/pb-16S-nf/test_data/testing.tsv \
                                                --metadata /home/ICE/jbeer/pb-16S-nf/test_data/test_metadata.tsv \
                                                --skip_primer_trim true \
                                                --VSEARCH_threads 30 \
                                                --DADA2_threads 30 \
                                                --cutadapt_threads 4 \
                                                -profile conda
`

Just to provide some more details on issues I have encountered:

Failed to submit process to grid scheduler for execution
Command executed:
qsub -terse .command.run
Command exit status:
1
Command output:
Unable to run job: “job” denied: use parallel environments instead of requesting slots explicitly
Exiting.

Work dir:
/home/ICE/jbeer/pb-16S-nf/work/8d/a94282f9449ce56b8e96a0532739b2

Tip: you can replicate the issue by changing to the process work dir and entering the command bash .command.run

– Check ‘.nextflow.log’ file for details

====== begin epilog ======
Job finished at 2024-08-27_11:46:07.236063476 (UTC+02)
Exit status = 1
====== end epilog ======

After adding process.penv = 'smp' to the nextflow.config file, I got the error:

Command wrapper:
====== begin prolog ======
Job name = nf-pb16S_QC_fastq_(1), Job-ID = 57199, owner = jbeer
Workdir = /home/ICE/jbeer/pb-16S-nf/work/f6/aab9e7089e31df3261805845c57d0f
PE = smp, slots = 64, queue = bigmem
Running on clu-blade14.ice.mpg.de, started at 2024-09-03_13:42:29.519821037 (UTC+02)
====== end prolog ======

/opt/sge/default/spool/clu-blade14/job_scripts/57199: 21: /opt/sge/default/spool/clu-blade14/job_scripts/57199: [[: not found
/opt/sge/default/spool/clu-blade14/job_scripts/57199: 30: /opt/sge/default/spool/clu-blade14/job_scripts/57199: Syntax error: redirection unexpected

====== begin epilog ======
Job finished at 2024-09-03_13:42:29.713545426 (UTC+02)
Exit status = 2
====== end epilog ======

Work dir:
/home/ICE/jbeer/pb-16S-nf/work/f6/aab9e7089e31df3261805845c57d0f

Tip: you can replicate the issue by changing to the process work dir and entering the command bash .command.run

– Check ‘.nextflow.log’ file for details

====== begin epilog ======
Job finished at 2024-09-03_13:49:35.383213442 (UTC+02)
Exit status = 1
====== end epilog ======
“HiFi16SJob.o57198” 267L, 11470C

Any input on what I am doing wrong with the job_script or cluster would be greatly appreciated! Please let me know if you have any advice. I have worked with 16s amplicon data before, but I am fairly inexperienced with PacBio/long read amplicon data analysis.

Kind regards,
Johann

nextflow.config (1.5 KB)

Hi Johann!

Welcome to the community! :tada:

It looks like you’re having difficulty with setting up the Nextflow runner properly. One suggestion that might be helpful, is trying to run nextflow-io/hello as your first pipeline instead of the BacBio pipeline. That will help you troubleshoot any problems with the nextflow itself without worrying about a complicated pipeline / data.

The piece that appears to be missing from your configuration is that you need to tell nextflow that it is supposed to create child jobs using SGE-style qsub commands. You do that by selecting the SGE executor.

That should be as simple as adding the following line to your nextflow.config file:

process.executor = 'sge'

Hi Ken,

Thank you so much for your advice. I could run the nextflow hello world pipeline. I also then moved to the nextflow.config file and the code you recommended worked and the process is currently running on the cluster. I will let you know if this worked.

Thanks and have a great weekend!

Hi @Johanndb!

Everything work well with your cluster execution? It’s always exciting to hear about about people’s successes.

Hi Ken,

Thanks for checking in! I have given up on the Nextflow cluster integration. Specifically, I troubleshooted to a point where the error and output files were empty but Nextflow would still not work. So what I decided to do, was to shift to R. R was fairly easy to integrate into the cluster. It would be great to use Nextflow, as theoretically speaking it provides more with running one command, but unfortunately this just did not work for me.

I really hope I can use it one day, but for now I can’t spend much more time on troubleshooting.