Running on Requin fails

Issue #205 closed
Erik Schnetter created an issue

Running on Requin on Sharcnet fails. The simulation is apparently created and submitted fine, but when the run command executes the simulation is inactive (!). I assume it either was not activated during submission (which it should have), or it was inactivated by a cleanup command in the mean time (which should not have happened, and which did not leave a log trail.)

I attach the screen output from remotely creating and submitting the simulation, and also the simfactory log file from Requin.n

Keyword:

Comments (7)

  1. Erik Schnetter reporter
    • removed comment

    Running on Requin still fails. The screen output does not show anything suspicious. I attach the restart log file.

  2. Barry Wardell
    • removed comment

    This problem was preventing my jobs from running on Datura too. I tracked it down to CleanupRestarts() which gets called in main() in sim.py. (Why does this need to be called every time SimFactory is run?)

    When the queueing system starts the job, it calls the submit script which executes 'sim run ...'. CleanupRestarts() checks if active_id != -1, which it is not as it has already been set to 0. It then calls finish() for that restart so that it is no longer active by the time the runscript is run.

    I don't really understand the code well enough to suggest a solution, although removing the CleanupRestarts() call from sim.py is sufficient to get my job to run.

  3. Barry Wardell
    • removed comment

    I've attached a patch which comments out the CleanupRestarts and also adds/fixes a couple of things which I noticed while investigating this.

  4. Ian Hinder
    • removed comment

    Erik, please could you retest this? If the problem has been fixed we can close the ticket, and if there is a workaround we can lower the priority from "critical".

  5. Log in to comment