[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Condor-users] "Bad CONDOR_JOB_STATUS_CONSTRAINED Result" [Sec=Unclassified]



I've posted this problem recently but had no resolution, but think it
may have been because it was formatted HTML?

Anyway,

I upgraded our Condor config to Condor-C to address problems for users
submitting from semi-permanent laptops, and now they are unable to
receive their results.

Condor_C Jobs are being submitted, executed and are showing as status
Complete on the remote linux central manager but are not being returned
to the windows submit machines. The GridManager keeps returning the
following error:
 
6/26 08:56:10 [1556] ERROR "Bad CONDOR_JOB_STATUS_CONSTRAINED Result" at
line 3808 in file ..\src\condor_gridmanager\gahp-client.C
 
I also keep ending up with a core.C_GAHP.WIN32 core dump from the GAHP
server. 

All 5 execute machines are linux, same config and dedicated.
Submitters are Windows XP.  Central Manager is linux.
Using 6.9.3, not sure if upgrading might help?

With further testing it appears to happen when I submit more than one
job.  If I submit one job and wait the results are returned.  If I
submit more than one then they all just sit there on the central manager
in a Completed state.  If I then remove the first job the rest are
returned as expected.

In the gridlog below, The job was sitting Completed on the central
manager and the gridmanager just cycled through the first 15 or so lines
repeatedly.  I then did a condor_rm on the job and the error was thrown.
A second job I had submitted, which was also sitting 'Complete' and
blocked? by the first was then returned.

6/27 14:57:47 [472] Using constraint
((Owner=?="troy_rob"&&JobUniverse==9)) && ((Managed =!= "ScheddDone"))
&& (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?=
"External")) 6/27 14:57:47 [472] Fetched 0 job ads from schedd 6/27
14:57:47 [472] leaving doContactSchedd() 6/27 14:57:49 [472] ***
UpdateLeases called
6/27 14:57:49 [472]     UpdateLeases: calc'ing new leases
6/27 14:57:50 [472] GAHP[4012] <- 'RESULTS'
6/27 14:57:50 [472] GAHP[4012] -> 'S' '0'
6/27 14:58:19 [472] *** UpdateLeases called
6/27 14:58:19 [472]     UpdateLeases: calc'ing new leases
6/27 14:58:47 [472] Getting monitoring info for pid 472
6/27 14:58:47 [472] Received CHECK_LEASES signal
6/27 14:58:47 [472] in doContactSchedd()
6/27 14:58:47 [472] ZKM: setting default map to (null)
6/27 14:58:47 [472] querying for renewed leases
6/27 14:58:47 [472] querying for removed/held jobs
6/27 14:58:47 [472] Using constraint
((Owner=?="troy_rob"&&JobUniverse==9)) && ((Managed =!= "ScheddDone"))
&& (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?=
"External")) 6/27 14:58:47 [472] Fetched 0 job ads from schedd 6/27
14:58:47 [472] leaving doContactSchedd() 6/27 14:58:49 [472] ***
UpdateLeases called
6/27 14:58:49 [472]     UpdateLeases: calc'ing new leases
6/27 14:58:50 [472] GAHP[4012] <- 'RESULTS'
6/27 14:58:50 [472] GAHP[4012] -> 'S' '0'
6/27 14:59:13 [472] Received REMOVE_JOBS signal
6/27 14:59:13 [472] in doContactSchedd()
6/27 14:59:13 [472] ZKM: setting default map to (null)
6/27 14:59:13 [472] querying for new jobs
6/27 14:59:13 [472] Using constraint
((Owner=?="troy_rob"&&JobUniverse==9)) && (Managed =!= "ScheddDone") &&
(Matched =!= FALSE) && (JobStatus != 5) && (Managed =!= "External") 6/27
14:59:13 [472] Fetched 0 new job ads from schedd 6/27 14:59:13 [472]
querying for removed/held jobs 6/27 14:59:13 [472] Using constraint
((Owner=?="troy_rob"&&JobUniverse==9)) && ((Managed =!= "ScheddDone"))
&& (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?=
"External")) 6/27 14:59:13 [472] Fetched 1 job ads from schedd 6/27
14:59:13 [472] leaving doContactSchedd() 6/27 14:59:13 [472] (107.0)
doEvaluateState called: gmState GM_SUBMITTED, remoteState 1 6/27
14:59:13 [472] (107.0) gm state change: GM_SUBMITTED -> GM_CANCEL 6/27
14:59:13 [472] GAHP[4012] <- 'CONDOR_JOB_REMOVE 11 erm-43880.xxx.yyy.au
69.0 by\ gridmanager' 6/27 14:59:13 [472] GAHP[4012] -> 'S' 6/27
14:59:19 [472] *** UpdateLeases called
6/27 14:59:19 [472]     UpdateLeases: calc'ing new leases
6/27 14:59:30 [472] GAHP[4012] <- 'RESULTS'
6/27 14:59:30 [472] GAHP[4012] -> 'R'
6/27 14:59:30 [472] GAHP[4012] -> 'S' '1'
6/27 14:59:30 [472] GAHP[4012] -> '10' 'S' 'NULL' '2' '[ MyType = "Job";
TargetType = "Machine"; GlobalJobId =
"ERM-43880.xxx.yyy.au#1214541518#69.0"; NTDomain = "ANTDIV"; MinHosts =
1; MaxHosts = 1; WantRemoteSyscalls = FALSE; WantCheckpoint = FALSE;
JobPrio = 0; NiceUser = FALSE; WantRemoteIO = TRUE; CoreSize = 4856408;
Rank = 0.000000; In = "/dev/null"; TransferIn = FALSE; Out =
"Output_107.0.txt"; StreamOut = FALSE; Err = "Error_107.0.txt";
StreamErr = FALSE; BufferSize = 524288; BufferBlockSize = 32768;
TransferInput = "estimation.csl,population.csl,output.csl,mpd.dat";
ExecutableSize_RAW = 10000; ExecutableSize = 10000; JobUniverse = 5;
QDate = 1214542010; LocalUserCpu = 0.000000; LocalSysCpu = 0.000000;
ExitStatus = 0; NumCkpts_RAW = 0; NumCkpts = 0; NumRestarts = 0;
NumSystemHolds = 0; CommittedTime = 0; TotalSuspensions = 0;
CumulativeSuspensionTime = 0; JobNotification = 0; LeaveJobInQueue =
JobStatus == 4; User = "troy_rob@xxxxxxxxxxxxxxxxxxxx"; Owner =
"troy_rob"; PeriodicRemove = (StageInFinish > 0) =!= TRUE && CurrentTime
> QDate + 28800; SubmitterId = "SCI-47798.XXX.YYY.AU"; requirements =
Arch == "X86_64" && OpSys == "LINUX"; universe = vanilla;
shouldtransferfiles = "YES"; whentotransferoutput = "ON_EXIT"; Arguments
= "-e -O mpd.dat"; Environment = ""; ClusterId = 69; ProcId = 0;
PeriodicHold = FALSE; PeriodicRelease = FALSE; OnExitHold = FALSE;
OnExitRemove = TRUE; StageInStart = 1214541538; SUBMIT_Iwd =
"C:\\Stuff\\CASAL"; Iwd =
"/opt/condor-6.9.3/local.ERM-43880/spool/cluster69.proc0.subproc0";
TransferOutputRemaps = UNDEFINED; SUBMIT_Cmd =
"C:\\Stuff\\CASAL\\casal"; Cmd =
"/opt/condor-6.9.3/local.ERM-43880/spool/cluster69.proc0.subproc0/casal"
; StageInFinish = 1214541539; ReleaseReason = "Data files spooled";
LastHoldReason = "Spooling input data files"; AutoClusterId = 0;
AutoClusterAttrs =
"JobUniverse,LastCheckpointPlatform,NumCkpts,Requirements,NiceUser";
WantMatchDiagnostics = TRUE; LastMatchTime = 1214541600; NumJobMatches =
1; OrigMaxHosts = 1; JobStartDate = 1214541604; JobCurrentStartDate =
1214541604; JobRunCount = 1; TransferFiles = "NEVER"; DiskUsage_RAW =
8798; DiskUsage = 10000; LastJobLeaseRenewal = 1214541696; RemoteSysCpu
= 9.000000; RemoteUserCpu = 77.000000; ImageSize_RAW = 133276; ImageSize
= 140000; ExitBySignal = FALSE; ExitCode = 0; TerminationPending = TRUE;
BytesSent = 4359119.000000; BytesRecvd = 8618658.000000; JobStatus = 4;
EnteredCurrentStatus = 1214541697; LastSuspensionTime = 0;
RemoteWallClockTime = 93.000000; LastRemoteHost =
"slot2@xxxxxxxxxxxxxxxxxxxx"; LastPublicClaimId =
"<147.66.12.120:52508>#1212646565#857#..."; LastPublicClaimIds = "";
CurrentHosts = 0; CompletionDate = 1214541697; JobFinishedHookDone =
1214541697; ServerTime = 1214541818; MyType = "Job"; TargetType =
"Machine"; ]' '[ MyType = "Job"; TargetType = "Machine"; GlobalJobId =
"ERM-43880.xxx.yyy.au#1214541518#70.0"; NTDomain = "ANTDIV"; MinHosts =
1; MaxHosts = 1; WantRemoteSyscalls = FALSE; WantCheckpoint = FALSE;
JobPrio = 0; NiceUser = FALSE; WantRemoteIO = TRUE; CoreSize = 4856408;
Rank = 0.000000; In = "/dev/null"; TransferIn = FALSE; Out =
"Output_108.0.txt"; StreamOut = FALSE; Err = "Error_108.0.txt";
StreamErr = FALSE; BufferSize = 524288; BufferBlockSize = 32768;
TransferInput = "estimation.csl,population.csl,output.csl,mpd.dat";
ExecutableSize_RAW = 10000; ExecutableSize = 10000; JobUniverse = 5;
QDate = 1214542014; LocalUserCpu = 0.000000; LocalSysCpu = 0.000000;
ExitStatus = 0; NumCkpts_RAW = 0; NumCkpts = 0; NumRestarts = 0;
NumSystemHolds = 0; CommittedTime = 0; TotalSuspensions = 0;
CumulativeSuspensionTime = 0; JobNotification = 0; LeaveJobInQueue =
JobStatus == 4; User = "troy_rob@ERM-411' 'S' 'NULL' 6/27 14:59:30 [472]
ERROR "Bad CONDOR_JOB_STATUS_CONSTRAINED Result" at line 3808 in file
.\src\condor_gridmanager\gahp-client.C
6/27 14:59:52 WARNING: Config source is empty:
C:\condor/condor_config.local


Troy
___________________________________________________________________________

    Australian Antarctic Division - Commonwealth of Australia
IMPORTANT: This transmission is intended for the addressee only. If you are not the
intended recipient, you are notified that use or dissemination of this communication is
strictly prohibited by Commonwealth law. If you have received this transmission in error,
please notify the sender immediately by e-mail or by telephoning +61 3 6232 3209 and
DELETE the message.
        Visit our web site at http://www.antarctica.gov.au/
___________________________________________________________________________