[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] "Bad CONDOR_JOB_STATUS_CONSTRAINED Result"[Sec=Unclassified] [Sec=Unclassified]



Hi Jaime,

Thanks for your interest in the problem. Using 6.9.3, not sure if upgrading might help?

With further testing it appears to happen when I submit more than one job.  If I submit one job and wait the results are returned.  If I submit more than one then they sit there on the central manager in a Completed state.  If I then remove the first job the rest are then returned.

All 5 execute machines are same config and dedicated.

In the gridlog below, The job was sitting Completed on the central manager and the gridmanager just cycled through the first 15 or so lines repeatedly.  I then did a condor_rm on the job and the error was thrown.  A second job I had submitted, which was also sitting 'Complete' and blocked? by the first was then returned.

6/27 14:57:47 [472] Using constraint ((Owner=?="troy_rob"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
6/27 14:57:47 [472] Fetched 0 job ads from schedd
6/27 14:57:47 [472] leaving doContactSchedd()
6/27 14:57:49 [472] *** UpdateLeases called
6/27 14:57:49 [472]     UpdateLeases: calc'ing new leases
6/27 14:57:50 [472] GAHP[4012] <- 'RESULTS'
6/27 14:57:50 [472] GAHP[4012] -> 'S' '0'
6/27 14:58:19 [472] *** UpdateLeases called
6/27 14:58:19 [472]     UpdateLeases: calc'ing new leases
6/27 14:58:47 [472] Getting monitoring info for pid 472
6/27 14:58:47 [472] Received CHECK_LEASES signal
6/27 14:58:47 [472] in doContactSchedd()
6/27 14:58:47 [472] ZKM: setting default map to (null)
6/27 14:58:47 [472] querying for renewed leases
6/27 14:58:47 [472] querying for removed/held jobs
6/27 14:58:47 [472] Using constraint ((Owner=?="troy_rob"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
6/27 14:58:47 [472] Fetched 0 job ads from schedd
6/27 14:58:47 [472] leaving doContactSchedd()
6/27 14:58:49 [472] *** UpdateLeases called
6/27 14:58:49 [472]     UpdateLeases: calc'ing new leases
6/27 14:58:50 [472] GAHP[4012] <- 'RESULTS'
6/27 14:58:50 [472] GAHP[4012] -> 'S' '0'
6/27 14:59:13 [472] Received REMOVE_JOBS signal
6/27 14:59:13 [472] in doContactSchedd()
6/27 14:59:13 [472] ZKM: setting default map to (null)
6/27 14:59:13 [472] querying for new jobs
6/27 14:59:13 [472] Using constraint ((Owner=?="troy_rob"&&JobUniverse==9)) && (Managed =!= "ScheddDone") && (Matched =!= FALSE) && (JobStatus != 5) && (Managed =!= "External")
6/27 14:59:13 [472] Fetched 0 new job ads from schedd
6/27 14:59:13 [472] querying for removed/held jobs
6/27 14:59:13 [472] Using constraint ((Owner=?="troy_rob"&&JobUniverse==9)) && ((Managed =!= "ScheddDone")) && (JobStatus == 3 || JobStatus == 4 || (JobStatus == 5 && Managed =?= "External"))
6/27 14:59:13 [472] Fetched 1 job ads from schedd
6/27 14:59:13 [472] leaving doContactSchedd()
6/27 14:59:13 [472] (107.0) doEvaluateState called: gmState GM_SUBMITTED, remoteState 1
6/27 14:59:13 [472] (107.0) gm state change: GM_SUBMITTED -> GM_CANCEL
6/27 14:59:13 [472] GAHP[4012] <- 'CONDOR_JOB_REMOVE 11 erm-43880.xxx.yyy.au 69.0 by\ gridmanager'
6/27 14:59:13 [472] GAHP[4012] -> 'S'
6/27 14:59:19 [472] *** UpdateLeases called
6/27 14:59:19 [472]     UpdateLeases: calc'ing new leases
6/27 14:59:30 [472] GAHP[4012] <- 'RESULTS'
6/27 14:59:30 [472] GAHP[4012] -> 'R'
6/27 14:59:30 [472] GAHP[4012] -> 'S' '1'
6/27 14:59:30 [472] GAHP[4012] -> '10' 'S' 'NULL' '2' '[ MyType = "Job"; TargetType = "Machine"; GlobalJobId = "ERM-43880.xxx.yyy.au#1214541518#69.0"; NTDomain = "ANTDIV"; MinHosts = 1; MaxHosts = 1; WantRemoteSyscalls = FALSE; WantCheckpoint = FALSE; JobPrio = 0; NiceUser = FALSE; WantRemoteIO = TRUE; CoreSize = 4856408; Rank = 0.000000; In = "/dev/null"; TransferIn = FALSE; Out = "Output_107.0.txt"; StreamOut = FALSE; Err = "Error_107.0.txt"; StreamErr = FALSE; BufferSize = 524288; BufferBlockSize = 32768; TransferInput = "estimation.csl,population.csl,output.csl,mpd.dat"; ExecutableSize_RAW = 10000; ExecutableSize = 10000; JobUniverse = 5; QDate = 1214542010; LocalUserCpu = 0.000000; LocalSysCpu = 0.000000; ExitStatus = 0; NumCkpts_RAW = 0; NumCkpts = 0; NumRestarts = 0; NumSystemHolds = 0; CommittedTime = 0; TotalSuspensions = 0; CumulativeSuspensionTime = 0; JobNotification = 0; LeaveJobInQueue = JobStatus == 4; User = "troy_rob@xxxxxxxxxxxxxxxxxxxx"; Owner = "troy_rob"; PeriodicRemove = (StageInFinish > 0) =!= TRUE && CurrentTime > QDate + 28800; SubmitterId = "SCI-47798.XXX.YYY.AU"; requirements = Arch == "X86_64" && OpSys == "LINUX"; universe = vanilla; shouldtransferfiles = "YES"; whentotransferoutput = "ON_EXIT"; Arguments = "-e -O mpd.dat"; Environment = ""; ClusterId = 69; ProcId = 0; PeriodicHold = FALSE; PeriodicRelease = FALSE; OnExitHold = FALSE; OnExitRemove = TRUE; StageInStart = 1214541538; SUBMIT_Iwd = "C:\\Stuff\\CASAL"; Iwd = "/opt/condor-6.9.3/local.ERM-43880/spool/cluster69.proc0.subproc0"; TransferOutputRemaps = UNDEFINED; SUBMIT_Cmd = "C:\\Stuff\\CASAL\\casal"; Cmd = "/opt/condor-6.9.3/local.ERM-43880/spool/cluster69.proc0.subproc0/casal"; StageInFinish = 1214541539; ReleaseReason = "Data files spooled"; LastHoldReason = "Spooling input data files"; AutoClusterId = 0; AutoClusterAttrs = "JobUniverse,LastCheckpointPlatform,NumCkpts,Requirements,NiceUser"; WantMatchDiagnostics = TRUE; LastMatchTime = 1214541600; NumJobMatches = 1; OrigMaxHosts = 1; JobStartDate = 1214541604; JobCurrentStartDate = 1214541604; JobRunCount = 1; TransferFiles = "NEVER"; DiskUsage_RAW = 8798; DiskUsage = 10000; LastJobLeaseRenewal = 1214541696; RemoteSysCpu = 9.000000; RemoteUserCpu = 77.000000; ImageSize_RAW = 133276; ImageSize = 140000; ExitBySignal = FALSE; ExitCode = 0; TerminationPending = TRUE; BytesSent = 4359119.000000; BytesRecvd = 8618658.000000; JobStatus = 4; EnteredCurrentStatus = 1214541697; LastSuspensionTime = 0; RemoteWallClockTime = 93.000000; LastRemoteHost = "slot2@xxxxxxxxxxxxxxxxxxxx"; LastPublicClaimId = "<147.66.12.120:52508>#1212646565#857#..."; LastPublicClaimIds = ""; CurrentHosts = 0; CompletionDate = 1214541697; JobFinishedHookDone = 1214541697; ServerTime = 1214541818; MyType = "Job"; TargetType = "Machine"; ]' '[ MyType = "Job"; TargetType = "Machine"; GlobalJobId = "ERM-43880.xxx.yyy.au#1214541518#70.0"; NTDomain = "ANTDIV"; MinHosts = 1; MaxHosts = 1; WantRemoteSyscalls = FALSE; WantCheckpoint = FALSE; JobPrio = 0; NiceUser = FALSE; WantRemoteIO = TRUE; CoreSize = 4856408; Rank = 0.000000; In = "/dev/null"; TransferIn = FALSE; Out = "Output_108.0.txt"; StreamOut = FALSE; Err = "Error_108.0.txt"; StreamErr = FALSE; BufferSize = 524288; BufferBlockSize = 32768; TransferInput = "estimation.csl,population.csl,output.csl,mpd.dat"; ExecutableSize_RAW = 10000; ExecutableSize = 10000; JobUniverse = 5; QDate = 1214542014; LocalUserCpu = 0.000000; LocalSysCpu = 0.000000; ExitStatus = 0; NumCkpts_RAW = 0; NumCkpts = 0; NumRestarts = 0; NumSystemHolds = 0; CommittedTime = 0; TotalSuspensions = 0; CumulativeSuspensionTime = 0; JobNotification = 0; LeaveJobInQueue = JobStatus == 4; User = "troy_rob@ERM-411' 'S' 'NULL'
6/27 14:59:30 [472] ERROR "Bad CONDOR_JOB_STATUS_CONSTRAINED Result" at line 3808 in file ..\src\condor_gridmanager\gahp-client.C
6/27 14:59:52 WARNING: Config source is empty: C:\condor/condor_config.local


Troy

-----Original Message-----
From: condor-users-bounces@xxxxxxxxxxx [mailto:condor-users-bounces@xxxxxxxxxxx] On Behalf Of Jaime Frey
Sent: Friday, 27 June 2008 12:20 AM
To: Condor-Users Mail List
Subject: Re: [Condor-users] "Bad CONDOR_JOB_STATUS_CONSTRAINED Result"[Sec=Unclassified]

On Jun 25, 2008, at 6:14 PM, Troy Robertson wrote:


Condor_C Jobs are being submitted, executed and are showing as status Complete on the remote linux central manager but are not being returned to the windows submit machines.
The GridManager keeps returning the following error:
 
6/26 08:56:10 [1556] ERROR "Bad CONDOR_JOB_STATUS_CONSTRAINED Result" at line 3808 in file ..\src\condor_gridmanager\gahp-client.C
 
I also keep ending up with a core.C_GAHP.WIN32 core dump from the GAHP server.
..
Please, does anyone have any ideas? 
I upgraded our Condor config to Condor-C to address problems for users submitting from laptops and now they are unable to receive their results.

Can you add the following line to the Condor config file on the submitting machine:
GRIDMANAGER_DEBUG = D_FULLDEBUG

Then let the error occur again and send me the last 20 lines or so of the gridmanager log.

Thanks and regards,
Jaime Frey
UW-Madison Condor Team




___________________________________________________________________________

    Australian Antarctic Division - Commonwealth of Australia
IMPORTANT: This transmission is intended for the addressee only. If you are not the
intended recipient, you are notified that use or dissemination of this communication is
strictly prohibited by Commonwealth law. If you have received this transmission in error,
please notify the sender immediately by e-mail or by telephoning +61 3 6232 3209 and
DELETE the message.
        Visit our web site at http://www.antarctica.gov.au/
___________________________________________________________________________