Timeout issue in nextflow process run on Azure batch

Hi everyone i am Devops engineer who is new to nextflow , looking to run nextflow pipeline in Azure batch
I am working with input fastq samples for testing the digestability ratio of dog.

i am getting a timeout error at DADA2_ERR process , this is the process run after DADA2_PREPROCESSING step even if a set a timout or retry block still the same issue presists let me know how can i solve this.


Screenshot defines the process executed in my pipeline please have a look at this also


  task: name=NFCORE_AMPLISEQ:AMPLISEQ:DADA2_ERR (1); work-dir=az://nextflow/work_01_01_2024/f1/931fd5f40eb3e8abc49bd953a0bdae
  error [java.lang.RuntimeException]: java.net.SocketTimeoutException: timeout
Jan-02 00:04:15.849 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'NFCORE_AMPLISEQ:AMPLISEQ:DADA2_ERR (1)'
Caused by:
  timeout
java.lang.RuntimeException: java.net.SocketTimeoutException: timeout
        at rx.exceptions.Exceptions.propagate(Exceptions.java:57)
        at rx.observables.BlockingObservable.blockForSingle(BlockingObservable.java:463)
        at rx.observables.BlockingObservable.single(BlockingObservable.java:340)
        at com.microsoft.azure.batch.protocol.implementation.TasksImpl.get(TasksImpl.java:1237)
        at com.microsoft.azure.batch.TaskOperations.getTask(TaskOperations.java:658)
        at com.microsoft.azure.batch.TaskOperations.getTask(TaskOperations.java:600)
        at nextflow.cloud.azure.batch.AzBatchService$_getTask_lambda7.doCall(AzBatchService.groovy:325)
        at dev.failsafe.Functions.lambda$toCtxSupplier$11(Functions.java:236)
        at dev.failsafe.Functions.lambda$get$0(Functions.java:46)
        at dev.failsafe.internal.RetryPolicyExecutor.lambda$apply$0(RetryPolicyExecutor.java:75)
        at dev.failsafe.SyncExecutionImpl.executeSync(SyncExecutionImpl.java:176)
        at dev.failsafe.FailsafeExecutor.call(FailsafeExecutor.java:437)
        at dev.failsafe.FailsafeExecutor.get(FailsafeExecutor.java:115)
        at nextflow.cloud.azure.batch.AzBatchService.apply(AzBatchService.groovy:911)
        at nextflow.cloud.azure.batch.AzBatchService.getTask(AzBatchService.groovy:325)
        at nextflow.cloud.azure.batch.AzBatchTaskHandler.taskState0(AzBatchTaskHandler.groovy:160)
        at nextflow.cloud.azure.batch.AzBatchTaskHandler.checkIfCompleted(AzBatchTaskHandler.groovy:116)
        at nextflow.processor.TaskPollingMonitor.checkTaskStatus(TaskPollingMonitor.groovy:615)
        at nextflow.processor.TaskPollingMonitor.checkAllTasks(TaskPollingMonitor.groovy:537)
        at nextflow.processor.TaskPollingMonitor.pollLoop(TaskPollingMonitor.groovy:412)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
        at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1254)
        at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1030)
        at org.codehaus.groovy.runtime.InvokerHelper.invokePogoMethod(InvokerHelper.java:1036)
        at org.codehaus.groovy.runtime.InvokerHelper.invokeMethod(InvokerHelper.java:1019)
        at org.codehaus.groovy.runtime.InvokerHelper.invokeMethodSafe(InvokerHelper.java:97)
        at nextflow.processor.TaskPollingMonitor$_start_closure2.doCall(TaskPollingMonitor.groovy:293)
        at nextflow.processor.TaskPollingMonitor$_start_closure2.call(TaskPollingMonitor.groovy)
        at groovy.lang.Closure.run(Closure.java:498)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.net.SocketTimeoutException: timeout
        at okhttp3.internal.http2.Http2Stream$StreamTimeout.newTimeoutException(Http2Stream.java:672)
        at okhttp3.internal.http2.Http2Stream$StreamTimeout.exitAndThrowIfTimedOut(Http2Stream.java:680)
        at okhttp3.internal.http2.Http2Stream.takeHeaders(Http2Stream.java:153)
        at okhttp3.internal.http2.Http2Codec.readResponseHeaders(Http2Codec.java:131)
        at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at com.microsoft.rest.interceptors.LoggingInterceptor.intercept(LoggingInterceptor.java:116)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:127)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at com.microsoft.rest.retry.RetryHandler.intercept(RetryHandler.java:75)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at com.microsoft.rest.interceptors.CustomHeadersInterceptor.intercept(CustomHeadersInterceptor.java:140)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at com.microsoft.rest.interceptors.UserAgentInterceptor.intercept(UserAgentInterceptor.java:83)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at com.microsoft.azure.batch.auth.BatchSharedKeyCredentialsInterceptor.intercept(BatchSharedKeyCredentialsInterceptor.java:54)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at com.microsoft.rest.interceptors.BaseUrlHandler.intercept(BaseUrlHandler.java:43)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at com.microsoft.rest.interceptors.RequestIdHeaderInterceptor.intercept(RequestIdHeaderInterceptor.java:29)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
        at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:257)
        at okhttp3.RealCall.execute(RealCall.java:93)
        at retrofit2.OkHttpCall.execute(OkHttpCall.java:188)
        at retrofit2.adapter.rxjava.CallExecuteOnSubscribe.call(CallExecuteOnSubscribe.java:40)
        at retrofit2.adapter.rxjava.CallExecuteOnSubscribe.call(CallExecuteOnSubscribe.java:24)
        at rx.Observable.unsafeSubscribe(Observable.java:10327)
        at rx.internal.operators.OnSubscribeMap.call(OnSubscribeMap.java:48)
        at rx.internal.operators.OnSubscribeMap.call(OnSubscribeMap.java:33)
        at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:48)
        at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:30)
        at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:48)
        at rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:30)
        at rx.Observable.subscribe(Observable.java:10423)
        at rx.Observable.subscribe(Observable.java:10390)
        at rx.observables.BlockingObservable.blockForSingle(BlockingObservable.java:443)

Have you been able to reproduce this error more than once? Does this happen consistently?

yes this happens consistently at once it executed the DADA2_ERR process for a 1 hour duration and continued the next step DADA2_DENOISING but failed again there.

I checked with the engineering team and the API request is just checking the task status which Nextflow does periodically. It’s not really clear why it would timeout at a particular point in the pipeline.

Theexecutor.pollInterval configuration option is 10s by default for azure batch. Could you please try something higher like 30s or 60s? It’s a guess, worth trying.

ok will try setting up the Pollinterval to 30 seconds and run it

In parallel, I would like to make you aware that it has been created a pull request in the Nextflow GitHub repository to support retry for these time outs in Azure Batch :wink:.

If the previously suggested solution doesn’t work, I can assist you in trying this new feature (as soon as it’s merged) :handshake:

1 Like

Thank you so much will let you know ASAP :innocent:

Stuck up in the same error again what is the next step we can try :face_with_monocle:

One option is to clone the Nextflow GitHub repository, merge this branch, compile Nextflow (make compile), this way, it’ll support the retry policy for Azure, and run your pipeline again to see if it worked. Be careful to run the newly compiled Nextflow and not the one already installed in your system (with ./launch.sh run instead of nextflow run).

Hi @mribeirodantas Previously i tried to execute the nextflow pipeline with around 20 samples in Azure batch and my work terminal for executing nextflow command is local ubuntu terminal for which i used to connect to vpn and restart ntp everytime.

Now to overcome this just tried to execute this pipeline with only 3 samples and used AWS ec2 terminal as my work terminal for executing nextflow command it worked out.
Current vm type for azure batch pool - Standard_D64s_v3.

But still one or two process are taking around 15 to 30 minutes to complete if you could give me a suggestion on choosing the correct compute type based on the input samples it will be a great. :wink:

Are you running this -with-report? The plots at the bottom of the report are very useful to understand if it’s taking long because it just takes long or if it’s because resources were not adequately provided.

so far i have not tried that will try from now on :+1:

1 Like

i already have used -with-trace option in my command and able to get report and timeline files in html format since i am not aware of process executed not able to figure the compute needed.

I’m not sure if I understand. Are you having difficulty interpreting the plots to refine the resource requirements of your processes?

yes exactly you got it right

Check the plot below, for example.

Here, we can see that some processes (look at all the first boxplots on the left, for example) never used even 50% of the CPUs that they requested. In this case, we could be requesting 1 CPU, and they’re extremely simple processes, fine, but in some circumstances, we may be requesting 4 CPUs and the process (all its tasks) is not using even 10% of what we requested. We can do the same exercise with RAM memory. See the picture below:

No process used even 50% of the memory requested in the configuration. So ideally, pipeline developers should take these plots into consideration and refine their requests (this does not apply only to cloud computing, proper resource requests for clusters are also important). But doing this manually is not very nice :laughing:

In Seqera Cloud, there’s a button that automates this for you. Check printscreen below:

Thanks that was pretty clear to me :wink:

1 Like

Thanks dantas :wink: this helped me in reducing the pipeline duration :+1:

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.