[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] SOAP API... Jobs don't run



Cargnelli, Matthieu a écrit :

>Hi all,
>
>I'm trying to work with the SOAP API for condor. I use the 
>SOAPScheddApiHelper from 
>http://www.cs.wisc.edu/condor/birdbath/SOAPScheddApiHelper.java
>When I send a job, it enqueues correctly but never runs. If I try to 
>delete a job, it never seems to be thrown away from the queue completely
>  
>
Hi again,

It seems that my original problem evolved a little. My pool is still a 
single machine (my own) so everything should be as simple as possible...
Now, when I submit a job involving the transfer of 2 files, I find my 
files in the spool directory, as it should, but the logs are weird:

SchedLog:7/26 16:36:39 Activity on stashed negotiator socket
SchedLog:7/26 16:36:39 Negotiating for owner: 
condor@xxxxxxxxxxxxxxxxxxxxxxx
SchedLog:7/26 16:36:39 Checking consistency running and runnable jobs
SchedLog:7/26 16:36:39 Tables are consistent
SchedLog:7/26 16:36:39 Out of jobs - 1 jobs matched, 0 jobs idle, flock 
level = 0
SchedLog:7/26 16:36:39 Sent ad to central manager for 
nobody@xxxxxxxxxxxxxxxxxxxxxxx
SchedLog:7/26 16:36:39 Sent ad to 1 collectors for 
nobody@xxxxxxxxxxxxxxxxxxxxxxx
SchedLog:7/26 16:36:39 Sent ad to central manager for 
condor@xxxxxxxxxxxxxxxxxxxxxxx
SchedLog:7/26 16:36:39 Sent ad to 1 collectors for 
condor@xxxxxxxxxxxxxxxxxxxxxxx
SchedLog:7/26 16:36:43 Starting add_shadow_birthdate(20.0)
SchedLog:7/26 16:36:43 Started shadow for job 20.0 on 
"<10.251.147.33:56110>", (shadow pid = 4937)
SchedLog:7/26 16:36:44 Sent ad to central manager for 
nobody@xxxxxxxxxxxxxxxxxxxxxxx
SchedLog:7/26 16:36:44 Sent ad to 1 collectors for 
nobody@xxxxxxxxxxxxxxxxxxxxxxx
SchedLog:7/26 16:36:44 Sent ad to central manager for 
condor@xxxxxxxxxxxxxxxxxxxxxxx
SchedLog:7/26 16:36:44 Sent ad to 1 collectors for 
condor@xxxxxxxxxxxxxxxxxxxxxxx
SchedLog:7/26 16:36:44 Shadow pid 4937 for job 20.0 exited with status 100
SchedLog:7/26 16:36:44 contraint ((NiceUser == FALSE) && 
((AccountingGroup =?= "condor") || (AccountingGroup =?= UNDEFINED && 
Owner =?= "condor"))) does not
evaluate to bool
SchedLog:7/26 16:36:44 contraint ((NiceUser == FALSE) && 
((AccountingGroup =?= "condor") || (AccountingGroup =?= UNDEFINED && 
Owner =?= "condor"))) does not
evaluate to bool
SchedLog:7/26 16:36:44 contraint ((NiceUser == FALSE) && 
((AccountingGroup =?= "condor") || (AccountingGroup =?= UNDEFINED && 
Owner =?= "condor"))) does not
evaluate to bool
SchedLog:7/26 16:36:44 contraint ((NiceUser == FALSE) && 
((AccountingGroup =?= "condor") || (AccountingGroup =?= UNDEFINED && 
Owner =?= "condor"))) does not
evaluate to bool
SchedLog:7/26 16:36:44 contraint ((NiceUser == FALSE) && 
((AccountingGroup =?= "condor") || (AccountingGroup =?= UNDEFINED && 
Owner =?= "condor"))) does not
evaluate to bool
SchedLog:7/26 16:36:44 contraint ((NiceUser == FALSE) && 
((AccountingGroup =?= "condor") || (AccountingGroup =?= UNDEFINED && 
Owner =?= "condor"))) does not
evaluate to bool
SchedLog:7/26 16:36:44 contraint ((NiceUser == FALSE) && 
((AccountingGroup =?= "condor") || (AccountingGroup =?= UNDEFINED && 
Owner =?= "condor"))) does not
evaluate to bool
SchedLog:7/26 16:36:44 contraint ((NiceUser == FALSE) && 
((AccountingGroup =?= "condor") || (AccountingGroup =?= UNDEFINED && 
Owner =?= "condor"))) does not
evaluate to bool
SchedLog:7/26 16:36:44 contraint ((NiceUser == FALSE) && 
((AccountingGroup =?= "condor") || (AccountingGroup =?= UNDEFINED && 
Owner =?= "condor"))) does not
evaluate to bool
SchedLog:7/26 16:36:44 match (<10.251.147.33:56110>#1122044962#540) out 
of jobs (cluster id 20); relinquishing
SchedLog:7/26 16:36:44 Sent RELEASE_CLAIM to startd on 
<10.251.147.33:56110>
SchedLog:7/26 16:36:44 Match record (<10.251.147.33:56110>, 20, -1) deleted
SchedLog:7/26 16:36:44 DaemonCore: Command received via TCP from host 
<10.251.147.33:34130>
SchedLog:7/26 16:36:44 DaemonCore: received command 443 
(VACATE_SERVICE), calling handler (vacate_service)
SchedLog:7/26 16:36:44 Got VACATE_SERVICE from <10.251.147.33:34130>
SchedLog:7/26 16:41:44 Sent ad to central manager for 
nobody@xxxxxxxxxxxxxxxxxxxxxxx
SchedLog:7/26 16:41:44 Sent ad to 1 collectors for 
nobody@xxxxxxxxxxxxxxxxxxxxxxx
SchedLog:7/26 16:41:44 Sent ad to central manager for 
condor@xxxxxxxxxxxxxxxxxxxxxxx
SchedLog:7/26 16:41:44 Sent ad to 1 collectors for 
condor@xxxxxxxxxxxxxxxxxxxxxxx
SchedLog:7/26 16:46:44 Sent ad to central manager for 
nobody@xxxxxxxxxxxxxxxxxxxxxxx
SchedLog:7/26 16:46:44 Sent ad to 1 collectors for 
nobody@xxxxxxxxxxxxxxxxxxxxxxx
SchedLog:7/26 16:46:44 Sent ad to central manager for 
condor@xxxxxxxxxxxxxxxxxxxxxxx
SchedLog:7/26 16:46:44 Sent ad to 1 collectors for 
condor@xxxxxxxxxxxxxxxxxxxxxxx
SchedLog:7/26 16:51:44 Sent ad to central manager for 
nobody@xxxxxxxxxxxxxxxxxxxxxxx
SchedLog:7/26 16:51:44 Sent ad to 1 collectors for 
nobody@xxxxxxxxxxxxxxxxxxxxxxx
SchedLog:7/26 16:51:44 Sent ad to central manager for 
condor@xxxxxxxxxxxxxxxxxxxxxxx
SchedLog:7/26 16:51:44 Sent ad to 1 collectors for 
condor@xxxxxxxxxxxxxxxxxxxxxxx
SchedLog:7/26 16:56:44 Sent ad to central manager for 
nobody@xxxxxxxxxxxxxxxxxxxxxxx
SchedLog:7/26 16:56:44 Sent ad to 1 collectors for 
nobody@xxxxxxxxxxxxxxxxxxxxxxx
SchedLog:7/26 16:56:44 Sent ad to central manager for 
condor@xxxxxxxxxxxxxxxxxxxxxxx
SchedLog:7/26 16:56:44 Sent ad to 1 collectors for 
condor@xxxxxxxxxxxxxxxxxxxxxxx

StartLog:7/26 16:36:43 DaemonCore: Command received via TCP from host 
<10.251.147.33:34114>
StartLog:7/26 16:36:43 DaemonCore: received command 444 
(ACTIVATE_CLAIM), calling handler (command_activate_claim)
StartLog:7/26 16:36:43 vm1: Got activate_claim request from shadow 
(<10.251.147.33:34114>)
StartLog:7/26 16:36:43 vm1: Remote job ID is 20.0
StartLog:7/26 16:36:43 vm1: Got universe "VANILLA" (5) from request classad
StartLog:7/26 16:36:43 vm1: State change: claim-activation protocol 
successful
StartLog:7/26 16:36:43 vm1: Changing activity: Idle -> Busy
StartLog:7/26 16:36:44 DaemonCore: Command received via TCP from host 
<10.251.147.33:34127>
StartLog:7/26 16:36:44 DaemonCore: received command 404 
(DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)
StartLog:7/26 16:36:44 vm1: Called deactivate_claim_forcibly()
StartLog:7/26 16:36:44 Starter pid 4938 exited with status 0
StartLog:7/26 16:36:44 vm1: State change: starter exited
StartLog:7/26 16:36:44 vm1: Changing activity: Busy -> Idle
StartLog:7/26 16:36:44 DaemonCore: Command received via UDP from host 
<10.251.147.33:33650>
StartLog:7/26 16:36:44 DaemonCore: received command 443 (RELEASE_CLAIM), 
calling handler (command_release_claim)
StartLog:7/26 16:36:44 vm1: State change: received RELEASE_CLAIM command
StartLog:7/26 16:36:44 vm1: Changing state and activity: Claimed/Idle -> 
Preempting/Vacating
StartLog:7/26 16:36:44 vm1: State change: No preempting claim, returning 
to owner
StartLog:7/26 16:36:44 vm1: Changing state and activity: 
Preempting/Vacating -> Owner/Idle
StartLog:7/26 16:36:44 vm1: State change: IS_OWNER is false
StartLog:7/26 16:36:44 vm1: Changing state: Owner -> Unclaimed
StartLog:7/26 16:36:44 DaemonCore: Command received via UDP from host 
<10.251.147.33:33650>
StartLog:7/26 16:36:44 DaemonCore: received command 443 (RELEASE_CLAIM), 
calling handler (command_release_claim)
StartLog:7/26 16:36:44 Error: can't find resource with ClaimId 
(<10.251.147.33:56110>#1122044962#540)

StarterLog.vm1:7/26 16:36:43 Using config file: 
/opt/condor-6.7.8/etc/condor_config
StarterLog.vm1:7/26 16:36:43 Using local config files: 
/opt/condor-6.7.8/local.patrouille/condor_config.local
StarterLog.vm1:7/26 16:36:43 DaemonCore: Command Socket at 
<10.251.147.33:34115>
StarterLog.vm1:7/26 16:36:43 Done setting resource limits
StarterLog.vm1:7/26 16:36:43 Communicating with shadow 
<10.251.147.33:34113>
StarterLog.vm1:7/26 16:36:43 Submitting machine is 
"patrouille.grideads.net"
StarterLog.vm1:7/26 16:36:43 File transfer completed successfully.
StarterLog.vm1:7/26 16:36:44 Starting a VANILLA universe job with ID: 20.0
StarterLog.vm1:7/26 16:36:44 IWD: 
/opt/condor-6.7.8/local.patrouille/execute/dir_4938
StarterLog.vm1:7/26 16:36:44 About to exec 
/opt/condor-6.7.8/local.patrouille/execute/dir_4938/condor_exec.exe toto 
TRUE
StarterLog.vm1:7/26 16:36:44 Create_Process succeeded, pid=4940
StarterLog.vm1:7/26 16:36:44 Process exited, pid=4940, status=0
StarterLog.vm1:7/26 16:36:44 Got SIGQUIT.  Performing fast shutdown.
StarterLog.vm1:7/26 16:36:44 ShutdownFast all jobs.
StarterLog.vm1:7/26 16:36:44 **** condor_starter (condor_STARTER) 
EXITING WITH STATUS 0

Has someone ever seen this kind of problem before ?

Regards,

Matthieu Cargnelli