[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Not fully able to start jobs - permissions?



On May 12, 2005, at 3:29 PM, Rob Pieké wrote:

I'm having some weird problems where jobs aren't starting fully. The StarterLog file makes it look like it's trying to start but then chokes. Specifically, it seems to be looking for log files that it can write to. The directory it's looking for doesn't exist, but the directory one level up does have writing privileges (ie, that dir COULD be created if Condor wanted to do it). If I manually create the dir, Condor roars ahead and creates the logs and runs the job.

Now, what's interesting to me (and maybe should be a clue to me as to how to solve this problem) is that the same directory IS being created on the master server automatically. Is it possible that Condor assumes that this directory is network accessible and not per-machine? (I'm kinda grasping at straws here).

Cheers!


5/11 11:35:41 ******************************************************
5/11 11:35:41 ** condor_starter (CONDOR_STARTER) STARTING UP
5/11 11:35:41 ** /mnt/pike/gorn/Applications/condor-6.6.9- linux_x86_64/sbin/condor_starter
5/11 11:35:41 ** $CondorVersion: 6.6.9 Mar 10 2005 $
5/11 11:35:41 ** $CondorPlatform: I386-LINUX_RH9 $
5/11 11:35:41 ** PID = 25629
5/11 11:35:41 ******************************************************
5/11 11:35:41 Using config file: /mnt/condor/accounts/condor/ condor_config
5/11 11:35:41 Using local config files: /mnt/condor/accounts/condor/ hosts/loaner1/condor_config.local
5/11 11:35:41 DaemonCore: Command Socket at <216.94.116.106:33946>
5/11 11:35:41 Done setting resource limits
5/11 11:35:41 Starter communicating with condor_shadow <216.94.116.89:49266>
5/11 11:35:41 Submitting machine is "tamari.coredp.com"
5/11 11:35:41 Starting a VANILLA universe job with ID: 33.0
5/11 11:35:41 IWD: /var/adm/condor/spool/cluster33.proc0.subproc0
5/11 11:35:41 Failed to open standard output file '/var/adm/condor/ spool/cluster33.proc0.subproc0/condor.42811141-0.0.out': No such file or directory (errno 2)
5/11 11:35:41 Output file: /var/adm/condor/spool/ cluster33.proc0.subproc0/condor.42811141-0.0.out
5/11 11:35:41 Failed to open standard error file '/var/adm/condor/ spool/cluster33.proc0.subproc0/condor.42811141-0.0.error': No such file or directory (errno 2)
5/11 11:35:41 Error file: /var/adm/condor/spool/ cluster33.proc0.subproc0/condor.42811141-0.0.error
5/11 11:35:41 Failed to open some/all of the std files...
5/11 11:35:41 Aborting OsProc::StartJob.
5/11 11:35:41 Failed to start job, exiting
5/11 11:35:41 ShutdownFast all jobs.
5/11 11:35:41 **** condor_starter (condor_STARTER) EXITING WITH STATUS 0

The starter on your execute machine is trying to open files in the spool directory of your submit machine. I'm guessing your pool is configured to have a shared filesystem and you submitted the job with the -r or -s argument to condor_submit.


By default on unix, if you tell Condor that you have a shared filesystem (by setting FILESYSTEM_DOMAIN), Condor assumes all of your jobs' files are on that shared filesystem and the execute machine tries to open them directly. If you run condor_submit with -r or -s, all of the job's files are placed under the SPOOL directory on the submit machine (that is, the machine running the schedd you're submitting to). If that SPOOL directory isn't on the shared filesystem, the execute machine will fail to open the job's files.

The easiest way to fix this is to set should_transfer_files to YES in your submit file. This tells Condor to always transfer a job's files between the submit and execute machines, rather than assume they're accessible via a share filesystem.

+----------------------------------+---------------------------------+
|            Jaime Frey            |  Public Split on Whether        |
|        jfrey@xxxxxxxxxxx         |  Bush Is a Divider              |
|  http://www.cs.wisc.edu/~jfrey/  |         -- CNN Scrolling Banner |
+----------------------------------+---------------------------------+