Syntax error in a script that aims to process each FASTA file in ${params.assemblies} separately and completes all associated tasks before moving to the next file

lisalamberte · October 24, 2024, 5:16pm

Hi everyone,

I am writing a script that aims to process each FASTA file in ${params.assemblies} separately and completes all associated tasks before moving to the next file.

This is the current script that I have, which I anticipate should work:

#!/usr/bin/env nextflow

nextflow.enable.dsl=2

println "Assemblies: ${params.assemblies}"
println "Query: ${params.query}"

// Define input channels
assembly_ch = Channel.fromPath(params.assemblies)
                     .ifEmpty { error "Cannot find any assemblies matching: ${params.assemblies}" }

query_ch = Channel.value(params.query)

// Function to run the entire workflow for a given assembly
process runWorkflowForAssembly {
    input:
    path fasta_file
    path query_file

    output:
    path "blast_output/${query_file.baseName}_vs_${fasta_file.baseName}.out", 
         path "blast_output_filtered/${query_file.baseName}_vs_${fasta_file.baseName}_filtered.out",
         path "extracted_sequences/${fasta_file.baseName}_extracted.fasta",
         path "extracted_sequences/${query_file.baseName}_coords_with_strand.txt"

    script:
    """
    # Create a BLAST database for the FASTA file
    db_prefix="blastdb/${fasta_file.baseName}_blastdb"
    mkdir -p ${db_prefix}
    makeblastdb -in ${fasta_file} -dbtype nucl -out ${db_prefix}/${fasta_file.baseName}

    # Perform BLAST search
    blastn -query ${query_file} -db ${db_prefix}/${fasta_file.baseName} -out blast_output/${query_file.baseName}_vs_${fasta_file.baseName}.out -outfmt 6
    awk '\$4 >= 50' blast_output/${query_file.baseName}_vs_${fasta_file.baseName}.out > blast_output_filtered/${query_file.baseName}_vs_${fasta_file.baseName}_filtered.out

    # Extract sequences based on filtered BLAST output
    coords_file="extracted_sequences/${query_file.baseName}_coords_with_strand.txt"
    forward_coords_file="forward_coords_${fasta_file.baseName}.txt"
    reverse_coords_file="reverse_coords_${fasta_file.baseName}.txt"
    temp_reverse_file="temp_reverse_${fasta_file.baseName}.fasta"

    mkdir -p extracted_sequences

    # Extract coordinates and store in a file with forward and reverse strand info
    awk '{ 
        if (\$9 < \$10) {
            start = (\$9-50 < 1) ? 1 : \$9-50;
            print \$2":"start"-"(\$10+50)"\\tforward";
        } else {
            start = (\$10-50 < 1) ? 1 : \$10-50;
            print \$2":"start"-"(\$9+50)"\\treverse";
        }
    }' blast_output_filtered/${query_file.baseName}_vs_${fasta_file.baseName}_filtered.out > \$coords_file

    # Separate into forward and reverse coordinate files
    grep forward \$coords_file | cut -f1 > \$forward_coords_file
    grep reverse \$coords_file | cut -f1 > \$reverse_coords_file

    samtools faidx ${fasta_file}

    # Ensure the output file for extracted sequences is created or appended properly
    touch extracted_sequences/${fasta_file.baseName}_extracted.fasta

    # Check if forward coordinates exist and extract forward sequences
    if [ -s \$forward_coords_file ]; then
        samtools faidx ${fasta_file} -r \$forward_coords_file >> extracted_sequences/${fasta_file.baseName}_extracted.fasta || {
            echo "Error extracting forward sequences from ${fasta_file}" >> extracted_sequences/errors.log
        }
    fi

    # Check if reverse coordinates exist and extract reverse complement sequences
    if [ -s \$reverse_coords_file ]; then
        samtools faidx ${fasta_file} -r \$reverse_coords_file > \$temp_reverse_file || {
            echo "Error extracting reverse sequences from ${fasta_file}" >> extracted_sequences/errors.log
        }

        # Use Biopython to reverse complement the sequences and append them to the same file
        python3 -c "
import sys
from Bio import SeqIO

with open('\$temp_reverse_file') as fasta_in, open('extracted_sequences/${fasta_file.baseName}_extracted.fasta', 'a') as fasta_out:
    for record in SeqIO.parse(fasta_in, 'fasta'):
        record.seq = record.seq.reverse_complement()
        SeqIO.write(record, fasta_out, 'fasta')
" || {
            echo "Error in Biopython reverse complement" >> extracted_sequences/errors.log
        }

        rm \$temp_reverse_file
    fi
    """
}

// Workflow definition
workflow {
    assembly_ch
        .combine(query_ch) // Combine each assembly with the query
        .set { pairs -> 
            pairs.map { fasta_file, query_file -> 
                runWorkflowForAssembly(fasta_file, query_file)
            }
        }
}

But I keep on getting what seems to be a syntax error:

N E X T F L O W ~ version 24.04.2

Launching test.nf [dreamy_tesla] DSL2 - revision: 75d4027402

ERROR ~ Script compilation error

file : /rds/projects/m/mcdonamc-sept-protect-wp4/avrstb6_alignment/test.nf
cause: Unexpected input: ‘{’ @ line 15, column 32.
process runWorkflowForAssembly {
^

1 error

NOTE: If this is the beginning of a process or workflow, there may be a syntax error in the body, such as a missing or extra comma, for which a more specific error message could not be produced.

– Check ‘.nextflow.log’ file for details

What am I missing here?

Thanks very much in advance!

mahesh.binzerpanchal · October 25, 2024, 8:37am

You’ve got comma’s at the end of your path’s in the output: section.
It’s not clear whether you want to output a tuple, i.e. all four files as a single list in a channel, or whether each file should be put in it’s own channel ( in which case remove the commas at the end of the line ).

So you should either have:

    tuple( 
        path ("blast_output/${query_file.baseName}_vs_${fasta_file.baseName}.out"), 
        path ("blast_output_filtered/${query_file.baseName}_vs_${fasta_file.baseName}_filtered.out"),
        path ("extracted_sequences/${fasta_file.baseName}_extracted.fasta"),
        path ("extracted_sequences/${query_file.baseName}_coords_with_strand.txt")
    )

or

    path ("blast_output/${query_file.baseName}_vs_${fasta_file.baseName}.out")
    path ("blast_output_filtered/${query_file.baseName}_vs_${fasta_file.baseName}_filtered.out")
    path ("extracted_sequences/${fasta_file.baseName}_extracted.fasta")
    path ("extracted_sequences/${query_file.baseName}_coords_with_strand.txt")

Alexander_Nater · October 25, 2024, 9:53am

Your workflow definition looks very strange. You are calling the process within a map closure within a set closure?

Topic		Replies	Views
Multi-channel output cannot be applied to operator combine Ask for help nextflow	3	174	October 24, 2024
Compilation error when using ${file.join(' ')} in for loop Ask for help	1	11	November 13, 2024
Fixing error: FastQ file for reads 1 must be provided, cannot contain spaces and must have extension Ask for help nextflow , nf-core	5	131	February 27, 2025
How to wait until all files are processed at a task for next step? Ask for help	6	505	February 12, 2024
Fastq columns in nf-core pipeline even though they are not part of the samplesheet Ask for help	2	36	February 6, 2025

Syntax error in a script that aims to process each FASTA file in ${params.assemblies} separately and completes all associated tasks before moving to the next file

Related topics