How to handle NA in file/path in map and process?

Hi there,
I’ve multi-columnar CSV file as:

MM-0245,T-01,/path/foo/DNA-N-01-01_joined_R1.fastq.gz,/path/foo/DNA-N-01-01_joined_R2.fastq.gz,DNA-T-01-01_L003_R1_001.fastq.gz,/path/foo/DNA-T-01-01_L003_R2_001.fastq.gz,/path/foo/RNA-T-01-01_L003_R1_001.fastq.gz,/path/foo/RNA-T-01-01_L003_R2_001.fastq.gz
MM-0245,T-02,/path/foo/DNA-N-01-01_joined_R1.fastq.gz,/path/foo/DNA-N-01-01_joined_R2.fastq.gz,DNA-T-02-01_joined_R1.fastq.gz,/path/foo/DNA-T-02-01_joined_R2.fastq.gz,/path/foo/RNA-T-02-01_L003_R1_001.fastq.gz,/path/foo/RNA-T-02-01_L003_R2_001.fastq.gz
MM-0245,T-03,/path/foo/DNA-N-01-01_joined_R1.fastq.gz,/path/foo/DNA-N-01-01_joined_R2.fastq.gz,DNA-T-02-01_joined_R1.fastq.gz,/path/foo/DNA-T-02-01_joined_R2.fastq.gz,NA,NA

It follows structure as: patient,timepoint,normal_WES_R1,normal_WES_R2,tumor_WES_R1,tumor_WES_R2,RNA_R1,RNA_R2

There can be multiple patients in CSV file, a patient can have multiple samples/timepoints as in the example.
However, not all the time trios (DNA-normal, DNA-tumor and RNA) would be present. In the e.g. above MM-0245,T-03 has NA in rna forward read and reverse read

How do I avoid any errors when processing reaches NA?
I’ve main.nf as:

workflow {

    if (params.analysis=="both"){
        wes()
        rna()
 
    }
    if (params.analysis=="wes"){
        wes()
    }
    
    if (params.analysis=="rna"){
        rna() 
    }
}

I’ve rna.nf as:

include { arriba} from '../modules/rna/arriba.nf'
include { fastp_rna} from '../modules/rna/fastp_rna.nf'

workflow rna {

def csvFile = params.input_csvFile 

Channel.fromPath( csvFile )
        .splitCsv( )
		.map { row ->
            def patient_info = row[0]
            def sample_info=row[1]
            def normal_reads = tuple((row[2]),(row[3]))
            def tumor_reads = tuple((row[4]), (row[5]))
            def rna_reads = tuple((row[6]), (row[7]))
            
            return [patient: patient_info, sample:sample_info,normal: normal_reads, tumor: tumor_reads, rna: rna_reads ]
        }
        .set { samples }

        fastp_rna(samples)
}

I’ve rna - fastp.nf as:

process fastp_rna {

        conda '/data1/software/miniconda/envs/MMRADAR/'
        maxForks 3
        debug true
        errorStrategy 'retry'
    maxRetries 2
 label 'low_mem'
    
    publishDir path: "${params.outdir}/${patient_id}/${sample_id}/RNA/fastp/tumor/", mode: 'copy', pattern: '*_T*'

input: tuple val(patient_id), val(sample_id), 
path(normal_reads, stageAs: 'fastp_normal_reads/*'), 
path(tumor_reads, stageAs: 'fastp_tumor_reads/*'), 
path(rna_reads, stageAs: 'rna_reads/*')

    output: 
        tuple val(patient_id_tumor), val(sample_id), path("${patient_id_tumor}_trim_{1,2}.fq.gz"), emit: reads_tumor
        path("${patient_id_tumor}.fastp.json"), emit: json_tumor
        path("${patient_id_tumor}.fastp.html"), emit: html_tumor
        

    script:            
        patient_id_tumor=patient_id+"_T"
        def(r1_tumor,r2_tumor)=rna_reads

    """
    /data1/software/miniconda/envs/MMRADAR/bin/fastp  --in1 "${r1_tumor}" --in2 "${r2_tumor}" \
    -q 20  -u 20 -l 40 --detect_adapter_for_pe --out1 "${patient_id_tumor}_trim_1.fq.gz" \
--out2 "${patient_id_tumor}_trim_2.fq.gz" --json "${patient_id_tumor}.fastp.json" \
--html "${patient_id_tumor}.fastp.html" --thread 20
    """
}

workflow.onComplete { 
        log.info ( workflow.success ? "Done rna fastp!" : "Oops .. something went wrong in rna fastp"  )
}

Where do I put check to not process anything for the NA RNA - reads?

If I understood it correctly, you want the fastp_rna process to be skipped when it runs into an NA. The first thing to care about is how to handle the NA in the CSV file so that Nextflow doesn’t try to do file('NA'). Then, filter out the channel elements that contain the missing reads so that only proper input is directed at the RNA-related process. See the example below:

def createTupleOrString(fileString) {
  if (fileString == "NA") {
      return "NA"
  } else {
      return file(fileString)
  }
}

process RNA_process {
  debug true

  input:
  tuple val(sample_name), path(normal_reads), path(tumor_reads), path(rna_reads), val(time_point)

  output:
  stdout

  script:
  """
  echo ${rna_reads}
  """
}

workflow {
  Channel
    .fromPath(file("input_timestamp.csv"))
    .splitCsv(sep: ',')
    .map { row ->
      // Extract relevant information
      def sample_ID = row[0]
      def normal_Reads = tuple(file(row[1]), file(row[2]))
      def tumor_Reads = tuple(file(row[3]), file(row[4]))
      def rna_Reads = tuple(createTupleOrString(row[5]), createTupleOrString(row[6]))
      def time_Point = row[7]

      // Return a map with the processed information
      return [sample_name: sample_ID, normal: normal_Reads, tumor: tumor_Reads, rna: rna_Reads, time_point: time_Point]
    }
    .set { samples }

  samples
    .filter { it.rna[0] != 'NA' }
    | RNA_process
}

Sorry for late reply.
Yes, that is correct fastp_rna needs to be skipped for rows wehre NA is encountered.

Did my solution work for you?

Hi @mribeirodantas Yes, it worked.

Is there a way to check multiple things using OR/|| in it?

it.rna[0] != 'NA'

Like it.rna[0] != 'NA' || it.rna[1] != 'NA'

You can use || for the “or” in Groovy, but filtering for multiple conditions can be achieved without it, as you can see the example below:

...
 samples
    .filter { it.rna[0] != 'NA' }
    .filter { it.time_point != 'NA' }
    | RNA_process
...

Nice. Thank you

@mribeirodantas I think I need more help.

In the above example it’s fine skipping for RNA-NA.

However, I’d like to do analysis for the tumor and normal reads, say for instance whole-exome sequencing.

In example above, MM-0245 I’d like to process tumor and normal reads. It’s fine not to do the RNA analysis. How do I enable WES analysis for the two tuples: tumor and normal in this case?

If I use filter it will skip tumor and normal for them.

Maybe the best option here is to use the branch channel operator, and have as many channels as you want. Then, you can provide each channel to the desired subworkflow/process.

Check the snippet below:

hannel
    .of([type:'tumor',  rna:'NA'],
        [type:'tumor',  rna:file('/so/me/path')],
        [type:'normal', rna:file('/some/other/path')])
    .branch {
        tumor:  it.type == 'tumor' && it.rna != 'NA'
        normal: it.type == 'normal'
    }
    .set { result }

result.tumor.view { "$it is a tumor" }
// result.normal.view { "$it is not a tumor" }

You could do PROCESS_FOO(result.tumor) for example.

Does it seem to solve your problem?

@mribeirodantas

Sorry, I do not follow your solution here.
How do I do what?

I’ve the above CSV file/data. How do I begin with it and branch it out? Not able to understand branch while reading a CSV.

@mribeirodantas

I’m trying with below code, but can’t seem to make way:

Channel.fromPath(file("temp_timestamp_NA.csv"))  //Channel.fromPath(file("input_timestamp.csv"))
        .splitCsv(sep: ',').map{ row -> 
            // Extract relevant information
            def batch_info = row[0]
            def time_point=row[1]
            def normal_reads = tuple((row[2]),(row[3]))
            def tumor_reads = tuple((row[4]), (row[5]))
            def rna_reads = tuple(createTupleOrString(row[6]), createTupleOrString(row[7]))           
            
                         [
            [type: "tumor", data: tumor_reads,meta: [type: "tumor"]],
            [type: "normal", data: normal_reads, meta: [type: "normal"]],
            [type: "rna", data: rna_reads, meta: [type: "rna"]]
        ]
             
        }.branch { type, reads, meta ->
        tumor: meta.type == "tumor"
        normal: meta.type == "normal"
        rna: meta.type == "rna" && reads[0]!='NA'
    } | set {test}

    test.tumor.view { "$it is a tumor" }

Is this OK?

How do I save batch, timepoint information?

I am unable to view the contents. :frowning:

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.