[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] Condor expects files to spool, even if I tell it not to?



Hi all,

As I baby step along my process here, I find that I've now managed to set up my pool get jobs running within the pool, and made several people happy. Now, the next step I need to follow. I need to be able to submit a job, to my pool, from a remote host, using GSI authentication. This works (Note previous e-mail where I bungled around with GRIDMAP macro).

Any help on my next step would really be SUPER appreciated. Thanx in advance. Now, onto the issues I'm having. Feel free to point out dumb mistakes as well....like I said, I'm still learning here. ;)>

Now, my problem. When submitting from the remote host I issue the following command as a test.
	condor_submit -verbose -pool schedd-host -r schedd-host hostname.submit

hostname.submit looks like. (I know the requirements are a bit odd, but it's to resolve some issues I've had with matlab core dumping when run on i686 hosts via condor, and to emulate a job run that handles data delivery internally)

Universe        = vanilla
Executable      = /bin/hostname
Error           = hostname.err
Log             = hostname.log
GetEnv          = False
Arguments       = -f
Notification    = Error
should_transfer_files = IF_NEEDED
transfer_executable = False
copy_to_spool   = False
when_to_transfer_output = ON_EXIT
Requirements = (FileSystemDomain =!= "") && (Arch =!= "IA64") && (Memory >= ImageSize) && ((OpSys == "LINUX") || (Op
Sys == "SOLARIS29") || (OpSys == "SOLARIS5.10") ) && (Arch =!= "INTEL")
remote_universe = vanilla
+remote_ShouldTransferFiles = IN_NEEDED
+remote_TransferExecutable = False
+remote_WhenToTransferFiles = ON_EXIT
+remote_requirements = '(FileSystemDomain =!= "") && (Arch =!= "IA64") && (Memory >= ImageSize) && ((OpSys == "LINUX") || (OpSys == "SOLARIS29") || (OpSys == "SOLARIS5.10") ) && (Arch =!= "INTEL")'
+remote_copytospool = False
Queue


when I try this job run, I get the following in the various log files on the schedd host I'm trying to submit to:

==> /opt/condor/local.divot/log/SchedLog <==
2/22 18:19:51 (pid:7249) DaemonCore: Command received via TCP from host <IP_ADDR:9677> 2/22 18:19:51 (pid:7249) DaemonCore: received command 488 (SPOOL_JOB_FILES_WITH_PERMS), calling handler (spoolJobFiles) 2/22 18:19:51 (pid:7735) Scheduler::spoolJobFilesWorkerThread(void *arg, Stream* s) NAP TIME 2/22 18:19:51 (pid:7249) DaemonCore: Command received via UDP from host < IP_ADDR:9637> 2/22 18:19:51 (pid:7249) DaemonCore: received command 421 (RESCHEDULE), calling handler (reschedule_negotiator) 2/22 18:19:51 (pid:7249) Sent ad to central manager for alathers@schedd-host 2/22 18:19:51 (pid:7249) Sent ad to 1 collectors for alathers@schedd- host
2/22 18:19:51 (pid:7249) Called reschedule_negotiator()
2/22 18:19:52 (pid:7249) Job 2722.0 released from hold: Data files spooled
2/22 18:19:52 (pid:7249) Called reschedule_negotiator()
2/22 18:19:56 (pid:7249) Sent ad to central manager for alathers@schedd-host 2/22 18:19:56 (pid:7249) Sent ad to 1 collectors for alathers@schedd- host
2/22 18:19:59 (pid:7249) Starting add_shadow_birthdate(2722.0)
2/22 18:19:59 (pid:7249) Started shadow for job 2722.0 on "<IP_ADDR: 9652>", (shadow pid = 7737) 2/22 18:20:00 (pid:7249) Shadow pid 7737 for job 2722.0 exited with status 4
2/22 18:20:00 (pid:7249) ERROR: Shadow exited with job exception code!
2/22 18:20:01 (pid:7249) Sent ad to central manager for alathers@schedd-host 2/22 18:20:01 (pid:7249) Sent ad to 1 collectors for alathers@schedd- host
2/22 18:20:02 (pid:7249) Starting add_shadow_birthdate(2722.0)
2/22 18:20:02 (pid:7249) Started shadow for job 2722.0 on "<IP_ADDR: 9652>", (shadow pid = 7738)


==> /opt/condor/local.divot/log/ShadowLog <==
2/22 18:19:59 ******************************************************
2/22 18:19:59 ** condor_shadow (CONDOR_SHADOW) STARTING UPschedd-host
2/22 18:19:59 ** /export/condor-6.7.13/sbin/condor_shadow
2/22 18:19:59 ** $CondorVersion: 6.7.13 Nov  7 2005 $
2/22 18:19:59 ** $CondorPlatform: I386-LINUX_RH9 $
2/22 18:19:59 ** PID = 7737
2/22 18:19:59 ******************************************************
2/22 18:19:59 Using config file: /export/condor/etc/condor_config
2/22 18:19:59 Using local config files: /export/condor-6.7.13/ local.divot/condor_config.local
2/22 18:19:59 DaemonCore: Command Socket at <IP_ADDR:46242>
2/22 18:19:59 Initializing a VANILLA shadow for job 2722.0
2/22 18:19:59 (2722.0) (7737): Request to run on <IP_ADDR:9652> was ACCEPTED 2/22 18:20:00 (2722.0) (7737): ERROR "Error from starter on vm1@workernode: Failed to execute '/export/condor/local.divot/spool/ cluster2722.proc0.subproc0/hostname condor_exec.exe -f': No such file or directory" at line 597 in file pseudo_ops.C


_______________________________________________________
Adam Lathers
NCMIR: National Center for Microscopy and Imaging Research
Distributed Systems Engineer
phone: (858) 534-7968
web:   http://ncmir.ucsd.edu