|
having spent time attempting to run jobs on shared GPUs that don't have virtual memory, i love returning to the land of overcommitCybernetic Vermin posted:but even there i think overcommit is good practice, it is a really convenient feature and great optimization, and doing anything more than trivial attempts to recover from oom becomes incredibly messy very quickly. not least software would almost necessarily allocate defensively (e.g. allocate all the memory you could need at the start of a transaction rather than risk having to deal with an oom partway through) tensorflow does this, it always reserves all available GPU memory at process boot. this means you're SOL if you're trying to run a job on a shared node. wanna run some intensive physical simulations on this cluster? too bad, somebody is doing Machine Learning and that is more important than whatever you're trying to do of course if you try to reserve more memory than available on a GPU, your CUDA program will immediately segfault. that's also annoying overcommit is fine
|
# ¿ Aug 3, 2020 09:57 |
|
|
# ¿ May 19, 2024 14:49 |