Timeout issue in nextflow process run on Azure batch

Naveen_Rajan · January 8, 2024, 11:01am

Hi everyone i am Devops engineer who is new to nextflow , looking to run nextflow pipeline in Azure batch
I am working with input fastq samples for testing the digestability ratio of dog.

i am getting a timeout error at DADA2_ERR process , this is the process run after DADA2_PREPROCESSING step even if a set a timout or retry block still the same issue presists let me know how can i solve this.

Screenshot defines the process executed in my pipeline please have a look at this also

  task: name=NFCORE_AMPLISEQ:AMPLISEQ:DADA2_ERR (1); work-dir=az://nextflow/work_01_01_2024/f1/931fd5f40eb3e8abc49bd953a0bdae
  error [java.lang.RuntimeException]: java.net.SocketTimeoutException: timeout
Jan-02 00:04:15.849 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_AMPLISEQ:AMPLISEQ:DADA2_ERR (1)'
Caused by:
  timeout
java.lang.RuntimeException: java.net.SocketTimeoutException: timeout
        at rx.exceptions.Exceptions.propagate(Exceptions.java:57)
        at rx.observables.BlockingObservable.blockForSingle(BlockingObservable.java:463)
        at rx.observables.BlockingObservable.single(BlockingObservable.java:340)
        at com.microsoft.azure.batch.protocol.implementation.TasksImpl.get(TasksImpl.java:1237)
        at com.microsoft.azure.batch.TaskOperations.getTask(TaskOperations.java:658)
        at com.microsoft.azure.batch.TaskOperations.getTask(TaskOperations.java:600)
        at nextflow.cloud.azure.batch.AzBatchService$_getTask_lambda7.doCall(AzBatchService.groovy:325)
        at dev.failsafe.Functions.lambda$toCtxSupplier$11(Functions.java:236)
        at dev.failsafe.Functions.lambda$get$0(Functions.java:46)
        at dev.failsafe.internal.RetryPolicyExecutor.lambda$apply$0(RetryPolicyExecutor.java:75)
        at dev.failsafe.SyncExecutionImpl.executeSync(SyncExecutionImpl.java:176)
        at dev.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:437)
        at dev.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:115)
        at nextflow.cloud.azure.batch.AzBatchService.apply(AzBatchService.groovy:911)
        at nextflow.cloud.azure.batch.AzBatchService.getTask(AzBatchService.groovy:325)
        at nextflow.cloud.azure.batch.AzBatchTaskHandler.taskState0(AzBatchTaskHandler.groovy:160)
        at nextflow.cloud.azure.batch.AzBatchTaskHandler.checkIfCompleted(AzBatchTaskHandler.groovy:116)
        at nextflow.processor.TaskPollingMonitor.checkTaskStatus(TaskPollingMonitor.groovy:615)
        at nextflow.processor.TaskPollingMonitor.checkAllTasks(TaskPollingMonitor.groovy:537)
        at nextflow.processor.TaskPollingMonitor.pollLoop(TaskPollingMonitor.groovy:412)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
        at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1254)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1030)
        at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:1036)
        at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:1019)
        at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:97)
        at nextflow.processor.TaskPollingMonitor$_start_closure2.doCall(TaskPollingMonitor.groovy:293)
        at nextflow.processor.TaskPollingMonitor$_start_closure2.call(TaskPollingMonitor.groovy)
        at groovy.lang.Closure.run(Closure.java:498)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.net.SocketTimeoutException: timeout
        at okhttp3.internal.http2.Http2Stream$StreamTimeout.newTimeoutException(Http2Stream.java:672)
        at okhttp3.internal.http2.Http2Stream$StreamTimeout.exitAndThrowIfTimedOut(Http2Stream.java:680)
        at okhttp3.internal.http2.Http2Stream.takeHeaders(Http2Stream.java:153)
        at okhttp3.internal.http2.Http2Codec.readResponseHeaders(Http2Codec.java:131)
        at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at com.microsoft.rest.interceptors.LoggingInterceptor.intercept(LoggingInterceptor.java:116)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:127)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at com.microsoft.rest.retry.RetryHandler.intercept(RetryHandler.java:75)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at com.microsoft.rest.interceptors.CustomHeadersInterceptor.intercept(CustomHeadersInterceptor.java:140)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at com.microsoft.rest.interceptors.UserAgentInterceptor.intercept(UserAgentInterceptor.java:83)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at com.microsoft.azure.batch.auth.BatchSharedKeyCredentialsInterceptor.intercept(BatchSharedKeyCredentialsInterceptor.java:54)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at com.microsoft.rest.interceptors.BaseUrlHandler.intercept(BaseUrlHandler.java:43)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at com.microsoft.rest.interceptors.RequestIdHeaderInterceptor.intercept(RequestIdHeaderInterceptor.java:29)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:257)
        at okhttp3.RealCall.execute(RealCall.java:93)
        at retrofit2.OkHttpCall.execute(OkHttpCall.java:188)
        at retrofit2.adapter.rxjava.CallExecuteOnSubscribe.call(CallExecuteOnSubscribe.java:40)
        at retrofit2.adapter.rxjava.CallExecuteOnSubscribe.call(CallExecuteOnSubscribe.java:24)
        at rx.Observable.unsafeSubscribe(Observable.java:10327)
        at rx.internal.operators.OnSubscribeMap.call(OnSubscribeMap.java:48)
        at rx.internal.operators.OnSubscribeMap.call(OnSubscribeMap.java:33)
        at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:48)
        at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:30)
        at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:48)
        at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:30)
        at rx.Observable.subscribe(Observable.java:10423)
        at rx.Observable.subscribe(Observable.java:10390)
        at rx.observables.BlockingObservable.blockForSingle(BlockingObservable.java:443)

mribeirodantas · January 8, 2024, 2:13pm

Have you been able to reproduce this error more than once? Does this happen consistently?

Naveen_Rajan · January 8, 2024, 2:32pm

yes this happens consistently at once it executed the DADA2_ERR process for a 1 hour duration and continued the next step DADA2_DENOISING but failed again there.

mribeirodantas · January 8, 2024, 3:55pm

I checked with the engineering team and the API request is just checking the task status which Nextflow does periodically. It’s not really clear why it would timeout at a particular point in the pipeline.

Theexecutor.pollInterval configuration option is 10s by default for azure batch. Could you please try something higher like 30s or 60s? It’s a guess, worth trying.

Naveen_Rajan · January 8, 2024, 4:40pm

ok will try setting up the Pollinterval to 30 seconds and run it

mribeirodantas · January 8, 2024, 6:01pm

In parallel, I would like to make you aware that it has been created a pull request in the Nextflow GitHub repository to support retry for these time outs in Azure Batch .

If the previously suggested solution doesn’t work, I can assist you in trying this new feature (as soon as it’s merged)

Naveen_Rajan · January 8, 2024, 6:25pm

Thank you so much will let you know ASAP

Naveen_Rajan · January 8, 2024, 6:33pm

Stuck up in the same error again what is the next step we can try

mribeirodantas · January 9, 2024, 4:56am

One option is to clone the Nextflow GitHub repository, merge this branch, compile Nextflow (make compile), this way, it’ll support the retry policy for Azure, and run your pipeline again to see if it worked. Be careful to run the newly compiled Nextflow and not the one already installed in your system (with ./launch.sh run instead of nextflow run).

Naveen_Rajan · January 9, 2024, 11:07am

Hi @mribeirodantas Previously i tried to execute the nextflow pipeline with around 20 samples in Azure batch and my work terminal for executing nextflow command is local ubuntu terminal for which i used to connect to vpn and restart ntp everytime.

Now to overcome this just tried to execute this pipeline with only 3 samples and used AWS ec2 terminal as my work terminal for executing nextflow command it worked out.
Current vm type for azure batch pool - Standard_D64s_v3.

But still one or two process are taking around 15 to 30 minutes to complete if you could give me a suggestion on choosing the correct compute type based on the input samples it will be a great.

mribeirodantas · January 9, 2024, 12:34pm

Are you running this -with-report? The plots at the bottom of the report are very useful to understand if it’s taking long because it just takes long or if it’s because resources were not adequately provided.

Naveen_Rajan · January 9, 2024, 12:47pm

so far i have not tried that will try from now on

Naveen_Rajan · January 10, 2024, 7:04am

i already have used -with-trace option in my command and able to get report and timeline files in html format since i am not aware of process executed not able to figure the compute needed.

mribeirodantas · January 11, 2024, 1:59pm

I’m not sure if I understand. Are you having difficulty interpreting the plots to refine the resource requirements of your processes?

Naveen_Rajan · January 11, 2024, 2:33pm

yes exactly you got it right

mribeirodantas · January 11, 2024, 11:44pm

Check the plot below, for example.

Here, we can see that some processes (look at all the first boxplots on the left, for example) never used even 50% of the CPUs that they requested. In this case, we could be requesting 1 CPU, and they’re extremely simple processes, fine, but in some circumstances, we may be requesting 4 CPUs and the process (all its tasks) is not using even 10% of what we requested. We can do the same exercise with RAM memory. See the picture below:

No process used even 50% of the memory requested in the configuration. So ideally, pipeline developers should take these plots into consideration and refine their requests (this does not apply only to cloud computing, proper resource requests for clusters are also important). But doing this manually is not very nice

In Seqera Cloud, there’s a button that automates this for you. Check printscreen below:

Naveen_Rajan · January 16, 2024, 6:33am

Thanks that was pretty clear to me

Naveen_Rajan · January 17, 2024, 8:12am

Thanks dantas this helped me in reducing the pipeline duration

system · January 24, 2024, 8:13am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Increasing allocation for running on google cloud Ask for help nextflow , nf-core	12	66	August 20, 2024
Nextflow Error Ask for help nextflow , nf-core , google-cloud , platform	5	433	July 1, 2024
Seqera cloud pipeline gets stuck in runnable Ask for help nextflow , aws , ampliseq	2	83	November 27, 2024
RNA-Seq pipeline fails early Ask for help nextflow , nf-core , google-cloud	3	101	October 8, 2024
Pipeline getting frozen (nf-core) Ask for help nextflow , nf-core	3	243	September 2, 2024

Timeout issue in nextflow process run on Azure batch

Related topics