Exhanging cores and running time

Our default allocation on the Deigo "compute" partition is 4 days and 2000 cores. On "short" we have 4000 cores and 2 hours. Sometimes, however, you may need more time. You can request to exchange cores for time, and back again.

Split your computation

If you need more time we strongly recommend that you try to subdivide your job in some way so that you can stay within the default time limits.

  • You may be able to divide your data set and run each part separately.
     
  • Long-running software often has a way to save its state midway through. You can stop the job, then start a new job that continues where the previous job left off.
     
  • Software with no means to checkpoint its calculations or subdivide tasks into shorter jobs is usually not intended for very long computations. You can take a look and see if there is some other software that is better adapted for the large data volumes and long computation time that you need.

Computers, programs and humans are all fallible. The more time your job needs, the greater is the risk that it will fail. Your application can crash; node hardware could break; the network might go down; a typhoon might shut down the entire cluster. If it happens while your job is still running, you may lose many days or weeks of time. If you can divide your computation into shorter jobs, you will lose much less time when something happens.

But sometimes this is not possible. Your computation depends on all data, and the application does not support checkpointing or resuming partway through. At that point, the only remaining route is to extend the running time.
 

Exchanging time and cores

We allow you to exchange cores with time on a need-to basis.

This change is per user, not per job. It will affect all your future jobs on the partition. You can of course apply to restore the original limits at any time. However, you can never have a mix of jobs with different limits submitted at the same time.

If you ask for more time, we give no guarantees that your job will be allowed to finish. As we wrote above, we may have a power failure, the node may crash, or we may shut down the system for maintenance.

Jobs with running times exceeding 7 days will not be considered when we shutdown for maintenance.

Also, the total amount of memory you can use on the partition decreases as you reduce the number of cores.

Here are the cores and running time you can ask for:
 

Short Partition
Time (hours) Cores Total
Memory
2 4000 6500G
4 1500 3000G
12 256 500G
24 80 500G
Compute Partition
Time (days) Cores Total
Memory
4 2000 7500G
7 1000 4000G
14 500 2000G
20 250 1000G

 

We do not grant more than 20 days total time. Your jobs are too unlikely to ever finish. If you need more time than that we strongly suggest that you:

  1. look for a way to snapshot and continue your calculation;
  2. try to find an alternative application that supports snapshots;
  3. reconsider your approach to solving this problem and try to find a faster or more efficient way to achieve your results.

How to Apply

Send an email to "ask-scda@oist.jp" and tell us the number of days you would like to have. Once you have no job running on the system we will make the change.