I’m a beginner in Nextflow. I have the following problem: Nextflow seems to be built for processing data that are in the same directory as the code being run, and as a consequence path object that are absolute are actually turned into relative paths when Nextflow runs.
However, in the HPC cluster I’m using, the rule is to store datasets in a shared dedicated directory, and to run code from a personal directory elsewhere. In consequence, I can’t use paths related to the Nextflow working directory, unless I do something obviously no ideal like ../../../../../../dataset_name.
Could you please explain to me what is the recommended, clean way to access data in my case?
For additional context, I have 2 datasets with the same internal structure that need to be processed using the same code, and I need to access several folders within each dataset by name. For example, I need to access dataset1/wavs, dataset2/segments/*_file.txt and so on. In bash, I would just define the dataset directory as a variable, and then append the subdirectories to it. What is the Nextflow equivalent of that?
Here is a minimal working example of what I’m trying to achieve (not functional):
params.dataset_dir = '/very/long/absolute/path/to/dataset'
process mfccs {
publishDir "results/mfccs", mode: 'copy'
input:
tuple path(dataset_dir), file(segments)
output:
file "${segments.baseName}.mat"
script:
"""
my_python_script.py \\
--segment-file ${dataset_dir}/segments_files/${segments} \\ # absolute path is lost
--wav-dir "${dataset_dir}/wavs" \\
--output-file "${segments.baseName}.mat"
"""
}
workflow {
segments_files = Channel.fromPath('*_segments.txt')
segments_ch = segments_files.map { file -> tuple(params.dataset_dir, segments_files) }.view()
segments_ch | mfccs
}
Thanks in advance for your help!