Studio session restart from checkpoint fails

Hi,

I am having issues lately about re-starting from any checkpoint a data studio having the following issue:

Has anybody seen that?

Thanks

Hi @lpantano,

Welcome to the community and thanks for your question.

This would typically only occur if one or more previous checkpoint images for that specific session (studio-fb29) are missing from S3. Could you please check if the files exist (perhaps someone has accidentally deleted one or more of the checkpoint files)? Also, can you please check to see if there is a lifecycle policy in place that removes files from that S3 folder?

Thanks,
Rob N

Thank you for chime in. There is no policy affecting that folder. I went and check the different data.img that is shown in that error. Two of them (the first and the third) are in there but 0 size. And the middle one doesn’t exists, and the .command.log shows an error. Sharing the screenshots and the log.

This example is an old DataStudio, I can try with a new one, that we have same error. Is there still the limitation of the 5Gb for the image size?

Screenshot 2025-02-13 at 7.09.15 PM
Screenshot 2025-02-13 at 7.08.58 PM
command.log (3.6 KB)

Hi @lpantano,

Thanks for the additional information - very helpful. There is no longer the 5GB limitation for the Data Studio session’s active volume - that was fixed back in early December 2024.

The way it works now is that if the free storage gets below a threshold then the volume resizing happens automatically without erroring. However, if the limit of the boot disk size of the Compute Environment is exceeded (~30GB) then problems can arise.

From your attached command.log file:

{"caller":"connect-client/main.go:49","level":"warning","msg":"Unexpected application failure: execute error:  5479788544 bytes available at /tmp/oe3549808997, minimum of 10737418240 needed","time":"2024-09-23T17:58:47.367Z"}

It looks like a minimum of almost 11GB is needed, but also that this session hasn’t been started since September 23rd, 2024 (which predates our updated image size limitation).

Let me discuss with our Engineering team and get back to you.

Warm regards,
Rob N

Hi @lpantano,

I spoke with our Engineering team, and there are several issues here:

  1. A missing checkpoint file means you can’t start the Studio session (it will fail when trying to load the missing checkpoint)
  2. Existing checkpoint files with 0B size suggests there was an error when saving/writing the checkpoint to S3. We did a quick test and just starting + stopping a Studio session results in a 4KB checkpoint file. Could there have been some permissions issues that caused this?

Unfortunately, this means that the state of this particular Studio session is lost and unrecoverable. What could be helpful, to avoid this happening in the future, is if you could send the logs from the Studio session (the job logs of the run on AWS Batch, to see what were the errors on saving the checkpoint files were) to my email.

Other ways to fix disk space related limitations in Studios are:

  1. Specify a larger boot disk size when creating a Compute Environment (Advanced options/Boot disk size)
  2. Stop some of the running Studios/Pipelines that share/use the same Compute Environment

Hopefully once you share the logs with me, we can narrow down the problem.

Thanks for your patience,
Rob N

This is very helpful.

I am not sure what could be the permission issues, since the keys are working well for pipelines.

The logs you are referring to, I need to take them from AWS console right? Or are there in Seqera?

I will try point 1 in the second section. We don’t have many running studios/pipelines in that compute environment, but I will have an eye on that.

Thanks! Appreciate the help!

Hi @lpantano,

Happy Friday! I’m glad I could be of assistance.

Regarding the logs - yes, you would need to retrieve those from the specific AWS Batch compute environment (CE) Job queue Id and then the specific Studios session Job Name in that CE Batch Job queue.

The CE Id is available from the Compute Environment page of the Platform - see the screenshot below:

In AWS Batch, this is will be your Job queue Name (except in AWS Batch it will be prefixed with TowerForge-<Id>).

You can now find the specific session Batch Job Name from the Studios session URL and this will show up in the selected AWS Batch Job queue as a unique Job Name.

  1. Your Studio session URL will be something like /orgs/<your-org-name>/workspaces/<your-workspace-name>/studios/<session-id>/details
  2. Copy the <session-id> and then match that with the AWS Batch Job Name in the list of all queue jobs data-studio-<session-id>
  3. Select (click) the specific Job Name link to go to the detail view
  4. To download logs go to the Logging tab at the foot of the page and click “Retrieve logs” - you should either already have Amazon CloudWatch enabled or you will need to enable it (note: you will be charged a small usage fee for it if you don’t have it enabled, so proceed with caution, and remember to turn it back off if you only enable it for this exercise to avoid incurring unnecessary costs moving forwards).
  5. Send the logs to my email! :slight_smile:

Thanks so much for your patience, and hopefully we can get this resolved for you.

Note: if these instructions are too complex (or you don’t have time!), just let me know and we can schedule a call to go through it together.

Have a wonderful weekend,
Rob N

Hi @lpantano,

Do you still need help with this request, or are you all set?

Warm regards,
Rob N

Hi Rob,

Let me close this. I will keep this information and get the logs this way if it happens again. I want to make sure there is still an issue, because many of these images were from December. Thanks!

1 Like