[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] job stuck in idle mode - HasFileTransfer



Garrett,
 
Thanks for your help once again. I changed the FILESYSTEM_DOMAIN on the execute node as you suggested and the jobsubmit file with YES.
 
'condor_status -long | grep -i transfer' does not return any result.
condor_status -long gives a long output with detail information about the execute node. Is there any particular attribute I should look at?
 
Here is the output -
 
Machine = "condor-slave.local"
EnteredCurrentState = 1313804328
MonitorSelfAge = 241
IsValidCheckpointPlatform = ( ( ( TARGET.JobUniverse == 1 ) == false ) || ( ( MY.CheckpointPlatform =!= undefined ) && ( ( TARGET.LastCheckpointPlatform =?= MY.CheckpointPlatform ) || ( TARGET.NumCkpts == 0 ) ) ) )
CpuIsBusy = false
LastBenchmark = 1313804353
HasVM = false
HibernationSupportedStates = "S4"
Name = "condor-slave.local"
StarterAbilityList = "HasRemoteSyscalls,HasCheckpointing"
AuthenticatedIdentity = "unauthenticated@unmapped"
NumPids = 0
TimeToLive = 2147483647
MonitorSelfSecuritySessions = 3
TotalDisk = 2788376
CondorVersion = "$CondorVersion: 7.6.2 Jul 15 2011 BuildID: 351672 $"
Unhibernate = MY.MachineLastMatchTime =!= undefined
LastUpdate = 1313804353
IsWakeOnLanSupported = false
Cpus = 1
HasCheckpointing = true
ClockDay = 5
HibernationLevel = 0
HibernationState = "NONE"
MonitorSelfRegisteredSocketCount = 1
StartdIpAddr = "<10.8.0.10:47725>"
TotalVirtualMemory = 265208
TotalLoadAvg = 0.0
HasIOProxy = true
HardwareAddress = "00:00:00:00:00:00"
UpdateSequenceNumber = 2
VirtualMemory = 265208
TotalMemory = 502
TotalTimeUnclaimedIdle = 279
MyAddress = "<10.8.0.10:47725>"
LastFetchWorkSpawned = 0
MyCurrentTime = 1313804632
COLLECTOR_HOST_STRING = "10.8.0.1"
FileSystemDomain = "condor-slave.local"
CpuBusyTime = 0
CpuBusy = ( ( LoadAvg - CondorLoadAvg ) >= 0.500000 )
HasRemoteSyscalls = true
MonitorSelfImageSize = 9048.000000
Memory = 502
IsWakeAble = false
CanHibernate = true
TotalTimeUnclaimedBenchmarking = 25
TotalCpus = 1
WakeOnLanSupportedFlags = "NONE"
ClockMin = 1123
UpdatesHistory = "0x00000000000000000000000000000000"
CurrentRank = 0.0
OpSys = "LINUX"
State = "Unclaimed"
KFlops = 1179712
Start = true
MonitorSelfCPUUsage = 0.008319
MaxJobRetirementTime = 0
Arch = "INTEL"
Mips = 13129
Activity = "Idle"
MonitorSelfTime = 1313804568
ConsoleIdle = 0
SubnetMask = "255.255.255.255"
UpdatesLost = 0
KeyboardIdle = 0
UpdatesSequenced = 2
TargetType = "Job"
CheckpointPlatform = "LINUX INTEL 2.6.x normal 0x4001d000"
Rank = 0.0
WakeOnLanEnabledFlags = "NONE"
UpdatesTotal = 3
CondorPlatform = "$CondorPlatform: x86_deb_5.0 $"
LoadAvg = 0.0
TotalCondorLoadAvg = 0.0
CurrentTime = time()
Disk = 2788376
CondorLoadAvg = 0.0
IsWakeOnLanEnabled = false
DaemonStartTime = 1313804328
TotalSlots = 1
UidDomain = "condor-slave.local"
EnteredCurrentActivity = 1313804353
SlotWeight = Cpus
SlotID = 1
LastFetchWorkCompleted = 0
NextFetchWorkDelay = -1
Requirements = ( START ) && ( IsValidCheckpointPlatform )
MyType = "Machine"
LastHeardFrom = 1313804665
MonitorSelfResidentSetSize = 4232
Appreicate your help.
 
Shiv


 
On Fri, Aug 19, 2011 at 6:02 PM, Koller, Garrett <kollerg14@xxxxxxxxxxxx> wrote:
Mr. Agarwal,

First of all, based on your situation, FILESYSTEM_DOMAIN should be set to $(FULL_HOSTNAME) (not "10.8.0.1, condor-mstr") since they don't share a filesystem.  In your submit file, "should_transfer_files" should always be set to "YES" for the same reason.  After you change the configuration file, restart both computers to make sure Condor has a fresh start with the new configuration settings.
That is odd.  HasFileTransfer should be defined, even if it's false for some reason.  What version of Condor are you running?  'condor -v'  Also, recheck the StartLog for unusual errors or warnings.
Do you get anything when you run 'condor_status -long | grep -i transfer'?  If not, what is the complete output of 'condor_status -long'?

Best Regards,
 - Garrett
condor.cs.wlu.edu

Sent: Friday, August 19, 2011 7:06 PM
To: Condor-Users Mail List
Subject: Re: [Condor-users] job stuck in idle mode - HasFileTransfer

Garrett,

Appreciate your quick reply. I tried the commands you mentioned.

condor_status -long | grep ^HasFileTransfer  - did not show any results

condor_status -long | grep ^FileSystemDomain - showed "10.8.0.1, condor-mstr" 

10.8.0.1 is the i.p. of my master node and "condor-mstr" is the hostname. 

In my execute node FILESYSTEM_DOMAIN  = 10.8.0.1, condor-mstr. I set it to both because when I run condor_config_val -v FILESYSTEM_DOMAIN in my master node it shows me "condor-mstr" but in my execute node the same command shows the i.p. which is "10.8.0.1"

I do not have NFS setup so I do need to transfer the files.

I don't even see any errors anywhere and what is driving me crazy is that the master does not even seem to try to transfer files. It just presumes that the execute node does not allow it as if something was preset when the execute node first connected to the master node.

This is my submit file 

Universe   = vanilla
Requirements  = Arch == "INTEL" &&  Memory >= 32
should_transfer_files = IF_NEEDED
when_to_transfer_output = ON_EXIT
Executable = simple
Arguments  = 4 10
Log        = outsimple.log
Output     = outsimple.$(Process).out
Error      = outsimple.error
Queue


Shiv

On Fri, Aug 19, 2011 at 3:54 PM, Koller, Garrett <kollerg14@xxxxxxxxxxxx> wrote:
Mr. Agarwal,

I don't think TRUST_UID_DOMAIN is the problem.  Run 'condor_status -long | grep ^HasFileTransfer' and 'condor_status -long | grep ^FileSystemDomain' to find out which of the two conditions is failing.  First of all, I'm assuming these two conditions have been automatically inserted into your job's requirements because you enabled file transfer in the submission file or Condor needs it by default.  Assuming that file transfer can work on all of your machines, HasFileTransfer should be true for all of your machines and FileSystemDomain should be set to the domain that all of the machines belong to (such as "cs.wisc.edu"), depending on your situation.  Check the FILESYSTEM_DOMAIN variable in the configuration files.  If your machines all share a similar filesystem (using NFS or mounted home directories or something), they should all be set to the same internet subdomain that they all belong to.
I know this is basic stuff, but hopefully this will prompt you to check your configuration to see if anything is wrong.  Besides that, I don't know exactly what causes Condor to set HasFileTransfer to be set to true or false.  Search the documentation for descriptions of these variables for more information.

Best Regards,
 - Garrett Heath Koller
kollerg14@xxxxxxxxxxxx

From: condor-users-bounces@xxxxxxxxxxx [condor-users-bounces@xxxxxxxxxxx] on behalf of Shiv Agarwal [shiv@xxxxxxxxxxx]
Sent: Friday, August 19, 2011 6:16 PM
To: condor-users
Subject: [Condor-users] job stuck in idle mode - HasFileTransfer

I have setup a small condor pool with 1 master node and 1 execute node.

I see not error messages in master or worker node log files whatsoever. In fact, the worker node does not even receive the request to execute the job. From my understanding the master node decides itself not to send the job to the execute node.

condor_q - analyze shows me that this particular requirement did not match ?

 ( ( TARGET.HasFileTransfer ) || ( TARGET.FileSystemDomain == "condor-mstr" ) )  0


I have even set the TRUST_UID_DOMAIN = True


Please HELP!


--
Shiv Agarwal

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/




--
Shiv Agarwal

_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-users-request@xxxxxxxxxxx with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/




--
Shiv Agarwal