[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Parallel universe/MPI issues when upgrading 7.2->7.4


Our users have come across a problem for MPI jobs running under the parallel universe when upgrading from 7.2.5 to 7.4.3, and though we have found a workaround (mentioned below), it would be great if we can identify a proper fix.

The issue is that jobs using the "usual" MPI wrapper script (e.g. mp1script) for such jobs now fail with the following:

In stdout:

error 0 chirp putting identity keys back

In stderr:

Can't chirp_client_open /home/condor/spool/cluster55247.proc0.subproc0/0.key:-1

Looking in the ShadowLog, it seems that a new permissions problem rears its head:

09/13 10:48:29 (55247.0) (30445): Request to run on slot1@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx <> was ACCEPTED
09/13 10:48:29 (55247.0) (30445): FileTransfer::Init(): mkdir(/home/condor/spool/cluster55247.proc0.subproc0) failed, Permission denied (errno: 13)

We have found that we can get around the issue by spooling the data on submission, i.e. via "condor_submit -spool" and then retrieving the data on completion via condor_transfer_data, before finally removing the job from the queue manually with condor_rm. This new behaviour is perplexing, as there have been no new configuration changes made to the hosts on upgrade.

Have we missed something necessary in the upgrade? From the release notes I can't discern any such new requirement, and having to remember to manually retrieve output and remove completed jobs from the queue is a pain in the unmentionables.

Best regards,