[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Condor jobs get matched, then released immediately




Take a look in the ShadowLog on the submit machine or in the StarterLog on the execute machine --- perhaps grep -i for "error".

One guessesis something the job needs immediately at startup is missing, such as the specified initial working direction or stdin file is missing. Condor (in v6.8.x) will automatically try to restart the job, just in case the missing files or directories are on a file server that is temporarily down. In v6.9.x, several errors of this sort will result in the job being retried a couple times and then placed on hold (with a hold reason).

Hope this helps,
Todd




Ngwa Godlove wrote:


Hi,

I’m new to condor,

recently installed 6.8.5 on a new pool with 4 nodes, 1 pool manager and 1 submitter. Every time I submit a job, my condor_status shows all nodes as claimed, and then they all immediately get switched back to unclaimed. Condor_reschedule does the same thing with the nodes going from claimed to unclaimed.

I’m tempted to think the origin of my problems is my condor configuration. Below is part of the startLog from one of my nodes. Can anyone tell what is wrong from this log? Any ideas are greatly appreciated.

6/25 14:39:20 DaemonCore: Command received via TCP from host <X.X.X.125:3844>

6/25 14:39:20 DaemonCore: received command 442 (REQUEST_CLAIM), calling handler (command_request_claim)

6/25 14:39:20 vm1: Request accepted.

6/25 14:39:20 vm1: Remote owner is BBBBBBB

6/25 14:39:20 vm1: State change: claiming protocol successful

6/25 14:39:20 vm1: Changing state: Unclaimed -> Claimed

6/25 14:39:28 DaemonCore: Command received via TCP from host <X.X.X.125:3876>

6/25 14:39:28 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling handler (command_activate_claim)

6/25 14:39:28 vm1: Got activate_claim request from shadow (<10.0.0.125:3876>)

6/25 14:39:28 vm1: Remote job ID is 11.3

6/25 14:39:28 vm1: Got universe "VANILLA" (5) from request classad

6/25 14:39:28 vm1: State change: claim-activation protocol successful

6/25 14:39:28 vm1: Changing activity: Idle -> Busy

6/25 14:39:34 DaemonCore: Command received via TCP from host <X.X.X.125:3901>

6/25 14:39:34 DaemonCore: received command 404 (DEACTIVATE_CLAIM_FORCIBLY), calling handler (command_handler)

6/25 14:39:34 vm1: Called deactivate_claim_forcibly()

6/25 14:39:34 DaemonCore: Command received via UDP from host <X.X.X.125:3904>

6/25 14:39:34 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_release_claim)

6/25 14:39:34 vm1: State change: received RELEASE_CLAIM command

6/25 14:39:34 vm1: Changing state and activity: Claimed/Busy -> Preempting/Vacating

6/25 14:39:34 DaemonCore: Command received via UDP from host <X.X.X.125:3905>

6/25 14:39:34 DaemonCore: received command 443 (RELEASE_CLAIM), calling handler (command_release_claim)

6/25 14:39:34 vm1: Got RELEASE_CLAIM while in Preempting state, ignoring.

6/25 14:39:34 DaemonCore: Command received via UDP from host <X.X.X.123:3738>

6/25 14:39:34 DaemonCore: received command 60011 (DC_NOP), calling handler (handle_nop())

6/25 14:39:34 Starter pid 2508 exited with status 0

6/25 14:39:34 vm1: State change: starter exited

6/25 14:39:34 vm1: State change: No preempting claim, returning to owner

6/25 14:39:34 vm1: Changing state and activity: Preempting/Vacating -> Owner/Idle

6/25 14:39:34 vm1: State change: IS_OWNER is false

6/25 14:39:34 vm1: Changing state: Owner -> Unclaimed

** Godlove Ntumngia **

** Axis GeoSpatial LLC **


------------------------------------------------------------------------

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at: https://lists.cs.wisc.edu/archive/condor-users/