Manage datasets that are not stored in working directory

I’m a beginner in Nextflow. I have the following problem: Nextflow seems to be built for processing data that are in the same directory as the code being run, and as a consequence path object that are absolute are actually turned into relative paths when Nextflow runs.
However, in the HPC cluster I’m using, the rule is to store datasets in a shared dedicated directory, and to run code from a personal directory elsewhere. In consequence, I can’t use paths related to the Nextflow working directory, unless I do something obviously no ideal like ../../../../../../dataset_name.
Could you please explain to me what is the recommended, clean way to access data in my case?
For additional context, I have 2 datasets with the same internal structure that need to be processed using the same code, and I need to access several folders within each dataset by name. For example, I need to access dataset1/wavs, dataset2/segments/*_file.txt and so on. In bash, I would just define the dataset directory as a variable, and then append the subdirectories to it. What is the Nextflow equivalent of that?

Here is a minimal working example of what I’m trying to achieve (not functional):

params.dataset_dir = '/very/long/absolute/path/to/dataset'

process mfccs {
	publishDir "results/mfccs", mode: 'copy'
	input:
	tuple path(dataset_dir), file(segments)
	output:
	file "${segments.baseName}.mat"
	script:
	"""
	my_python_script.py \\
	--segment-file ${dataset_dir}/segments_files/${segments} \\ # absolute path is lost
	--wav-dir "${dataset_dir}/wavs" \\
	--output-file "${segments.baseName}.mat"
	"""
}

workflow {
	segments_files = Channel.fromPath('*_segments.txt')
	segments_ch = segments_files.map { file -> tuple(params.dataset_dir, segments_files) }.view()
	segments_ch | mfccs
}

Thanks in advance for your help!

once you create the channel factory using your

segments_files = Channel.fromPath("$dataset_dir/*_segments.txt")

you dont need to call $dataset_dir ever again, nextflow will handle the file ingest (using symlink by default reduce storage usage).

I would just ingest

params.wavs = 'very/long/path/dataset1/wavs/'
params.dataset_dir = '/very/long/absolute/path/to/dataset2/segments/'

workflow {
segments_files = Channel.fromPath("${dataset_dir}/*_segments.txt")
wavs = Channel.fromPath("${wavs}/wavs.txt")
}

be careful with single quote and double quotes, single quote does not unpack the variable. The following probably does not work.

segments_files = Channel.fromPath('$dataset_dir/*_segments.txt')

That helped me solve my problem, thank you!

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.