How to handle NA in file/path in map and process?

complexgenome · January 18, 2024, 7:29pm

Hi there,
I’ve multi-columnar CSV file as:

MM-0245,T-01,/path/foo/DNA-N-01-01_joined_R1.fastq.gz,/path/foo/DNA-N-01-01_joined_R2.fastq.gz,DNA-T-01-01_L003_R1_001.fastq.gz,/path/foo/DNA-T-01-01_L003_R2_001.fastq.gz,/path/foo/RNA-T-01-01_L003_R1_001.fastq.gz,/path/foo/RNA-T-01-01_L003_R2_001.fastq.gz
MM-0245,T-02,/path/foo/DNA-N-01-01_joined_R1.fastq.gz,/path/foo/DNA-N-01-01_joined_R2.fastq.gz,DNA-T-02-01_joined_R1.fastq.gz,/path/foo/DNA-T-02-01_joined_R2.fastq.gz,/path/foo/RNA-T-02-01_L003_R1_001.fastq.gz,/path/foo/RNA-T-02-01_L003_R2_001.fastq.gz
MM-0245,T-03,/path/foo/DNA-N-01-01_joined_R1.fastq.gz,/path/foo/DNA-N-01-01_joined_R2.fastq.gz,DNA-T-02-01_joined_R1.fastq.gz,/path/foo/DNA-T-02-01_joined_R2.fastq.gz,NA,NA

It follows structure as: patient,timepoint,normal_WES_R1,normal_WES_R2,tumor_WES_R1,tumor_WES_R2,RNA_R1,RNA_R2

There can be multiple patients in CSV file, a patient can have multiple samples/timepoints as in the example.
However, not all the time trios (DNA-normal, DNA-tumor and RNA) would be present. In the e.g. above MM-0245,T-03 has NA in rna forward read and reverse read

How do I avoid any errors when processing reaches NA?
I’ve main.nf as:

workflow {

    if (params.analysis=="both"){
        wes()
        rna()
 
    }
    if (params.analysis=="wes"){
        wes()
    }
    
    if (params.analysis=="rna"){
        rna() 
    }
}

I’ve rna.nf as:

include { arriba} from '../modules/rna/arriba.nf'
include { fastp_rna} from '../modules/rna/fastp_rna.nf'

workflow rna {

def csvFile = params.input_csvFile 

Channel.fromPath( csvFile )
        .splitCsv( )
		.map { row ->
            def patient_info = row[0]
            def sample_info=row[1]
            def normal_reads = tuple((row[2]),(row[3]))
            def tumor_reads = tuple((row[4]), (row[5]))
            def rna_reads = tuple((row[6]), (row[7]))
            
            return [patient: patient_info, sample:sample_info,normal: normal_reads, tumor: tumor_reads, rna: rna_reads ]
        }
        .set { samples }

        fastp_rna(samples)
}

I’ve rna - fastp.nf as:

process fastp_rna {

        conda '/data1/software/miniconda/envs/MMRADAR/'
        maxForks 3
        debug true
        errorStrategy 'retry'
    maxRetries 2
 label 'low_mem'
    
    publishDir path: "${params.outdir}/${patient_id}/${sample_id}/RNA/fastp/tumor/", mode: 'copy', pattern: '*_T*'

input: tuple val(patient_id), val(sample_id), 
path(normal_reads, stageAs: 'fastp_normal_reads/*'), 
path(tumor_reads, stageAs: 'fastp_tumor_reads/*'), 
path(rna_reads, stageAs: 'rna_reads/*')

    output: 
        tuple val(patient_id_tumor), val(sample_id), path("${patient_id_tumor}_trim_{1,2}.fq.gz"), emit: reads_tumor
        path("${patient_id_tumor}.fastp.json"), emit: json_tumor
        path("${patient_id_tumor}.fastp.html"), emit: html_tumor
        

    script:            
        patient_id_tumor=patient_id+"_T"
        def(r1_tumor,r2_tumor)=rna_reads

    """
    /data1/software/miniconda/envs/MMRADAR/bin/fastp  --in1 "${r1_tumor}" --in2 "${r2_tumor}" \
    -q 20  -u 20 -l 40 --detect_adapter_for_pe --out1 "${patient_id_tumor}_trim_1.fq.gz" \
--out2 "${patient_id_tumor}_trim_2.fq.gz" --json "${patient_id_tumor}.fastp.json" \
--html "${patient_id_tumor}.fastp.html" --thread 20
    """
}

workflow.onComplete { 
        log.info ( workflow.success ? "Done rna fastp!" : "Oops .. something went wrong in rna fastp"  )
}

Where do I put check to not process anything for the NA RNA - reads?

mribeirodantas · January 18, 2024, 8:13pm

If I understood it correctly, you want the fastp_rna process to be skipped when it runs into an NA. The first thing to care about is how to handle the NA in the CSV file so that Nextflow doesn’t try to do file('NA'). Then, filter out the channel elements that contain the missing reads so that only proper input is directed at the RNA-related process. See the example below:

def createTupleOrString(fileString) {
  if (fileString == "NA") {
      return "NA"
  } else {
      return file(fileString)
  }
}

process RNA_process {
  debug true

  input:
  tuple val(sample_name), path(normal_reads), path(tumor_reads), path(rna_reads), val(time_point)

  output:
  stdout

  script:
  """
  echo ${rna_reads}
  """
}

workflow {
  Channel
    .fromPath(file("input_timestamp.csv"))
    .splitCsv(sep: ',')
    .map { row ->
      // Extract relevant information
      def sample_ID = row[0]
      def normal_Reads = tuple(file(row[1]), file(row[2]))
      def tumor_Reads = tuple(file(row[3]), file(row[4]))
      def rna_Reads = tuple(createTupleOrString(row[5]), createTupleOrString(row[6]))
      def time_Point = row[7]

      // Return a map with the processed information
      return [sample_name: sample_ID, normal: normal_Reads, tumor: tumor_Reads, rna: rna_Reads, time_point: time_Point]
    }
    .set { samples }

  samples
    .filter { it.rna[0] != 'NA' }
    | RNA_process
}

complexgenome · February 1, 2024, 4:53pm

Sorry for late reply.
Yes, that is correct fastp_rna needs to be skipped for rows wehre NA is encountered.

mribeirodantas · February 1, 2024, 10:41pm

Did my solution work for you?

complexgenome · February 7, 2024, 7:44pm

Hi @mribeirodantas Yes, it worked.

Is there a way to check multiple things using OR/|| in it?

it.rna[0] != 'NA'

Like it.rna[0] != 'NA' || it.rna[1] != 'NA'

mribeirodantas · February 7, 2024, 9:56pm

You can use || for the “or” in Groovy, but filtering for multiple conditions can be achieved without it, as you can see the example below:

...
 samples
    .filter { it.rna[0] != 'NA' }
    .filter { it.time_point != 'NA' }
    | RNA_process
...

complexgenome · February 8, 2024, 3:17pm

Nice. Thank you

complexgenome · February 8, 2024, 3:52pm

@mribeirodantas I think I need more help.

In the above example it’s fine skipping for RNA-NA.

However, I’d like to do analysis for the tumor and normal reads, say for instance whole-exome sequencing.

In example above, MM-0245 I’d like to process tumor and normal reads. It’s fine not to do the RNA analysis. How do I enable WES analysis for the two tuples: tumor and normal in this case?

If I use filter it will skip tumor and normal for them.

mribeirodantas · February 8, 2024, 4:09pm

Maybe the best option here is to use the branch channel operator, and have as many channels as you want. Then, you can provide each channel to the desired subworkflow/process.

Check the snippet below:

hannel
    .of([type:'tumor',  rna:'NA'],
        [type:'tumor',  rna:file('/so/me/path')],
        [type:'normal', rna:file('/some/other/path')])
    .branch {
        tumor:  it.type == 'tumor' && it.rna != 'NA'
        normal: it.type == 'normal'
    }
    .set { result }

result.tumor.view { "$it is a tumor" }
// result.normal.view { "$it is not a tumor" }

You could do PROCESS_FOO(result.tumor) for example.

Does it seem to solve your problem?

complexgenome · February 8, 2024, 4:19pm

@mribeirodantas

Sorry, I do not follow your solution here.
How do I do what?

I’ve the above CSV file/data. How do I begin with it and branch it out? Not able to understand branch while reading a CSV.

complexgenome · February 8, 2024, 5:00pm

@mribeirodantas

I’m trying with below code, but can’t seem to make way:

Channel.fromPath(file("temp_timestamp_NA.csv"))  //Channel.fromPath(file("input_timestamp.csv"))
        .splitCsv(sep: ',').map{ row -> 
            // Extract relevant information
            def batch_info = row[0]
            def time_point=row[1]
            def normal_reads = tuple((row[2]),(row[3]))
            def tumor_reads = tuple((row[4]), (row[5]))
            def rna_reads = tuple(createTupleOrString(row[6]), createTupleOrString(row[7]))           
            
                         [
            [type: "tumor", data: tumor_reads,meta: [type: "tumor"]],
            [type: "normal", data: normal_reads, meta: [type: "normal"]],
            [type: "rna", data: rna_reads, meta: [type: "rna"]]
        ]
             
        }.branch { type, reads, meta ->
        tumor: meta.type == "tumor"
        normal: meta.type == "normal"
        rna: meta.type == "rna" && reads[0]!='NA'
    } | set {test}

    test.tumor.view { "$it is a tumor" }

Is this OK?

How do I save batch, timepoint information?

I am unable to view the contents.

system · February 15, 2024, 5:00pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Handle NA and use branch operator - then send to process Ask for help	9	309	February 19, 2024
How to work with multiple time point/samples for a patient? read CSV Ask for help	0	167	January 11, 2024
Data are getting mixed // when resume is used Ask for help	1	172	February 28, 2024
How to iterate over groupTuple data for processes? Ask for help	8	207	January 17, 2024
How to access/print elements of a map? Ask for help	3	168	January 12, 2024

How to handle NA in file/path in map and process?

Related topics