[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Condor-users] Claimed Idle jobs on windows machines



-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I am seeing the following in the Starter logs as well. I have seem on
the mailing list that this could be related to permissions on the
windows machines but the permissions look fine.


6/24 15:00:10 condor_read(): recv() returned -1, errno = 10054, assuming failure. 6/24 15:00:10 ERROR "Assertion ERROR on (result)" at line 270 in file ..\src\condor_starter.V6.1\NTsenders.C 6/24 15:00:10 ShutdownFast all jobs.

Any help would be appreciated.

Joe Rinkovsky wrote:
| Hello,
|
| We have been having problems where jobs will end up in a state where
| they are in the queue marked as running but if you look at condor_status
| they are either Claimed Idle or have an owner.
|
| This behavior has been happening whether jobs complete on their own or
| someone starts using the machine. I wrote a test program that does
| nothing but sleep for 30 minutes and the output some text to the
| console. When submitted it will run for 30 minutes as it should and I
| will see the output but when it exits the machine goes to Claimed Idle
| and the job is still running in the queue. Here is the log from the
| STARTD on the machine that my test job was running on. The main thing I
| am confused about is why the starter is exiting with signal 4 and what
| is causing the machine state to not be changed back to unclaimed when
| the job is done. Our central manager is a Linux machine and the machines
| in the pool are Windows XP. They are all running condor 6.6.5. Any help
| would be appreciated. I just can't seem to figure out why this is
| happening. Any clues? Please let me know if you need additional
| information.
|
| Thanks in advance.
|
| --Joe Rinkovsky
| Unix Systems Support Group
| Indiana University
|
|
| ~ 6/24 08:31:10 Changing state: Unclaimed -> Matched
| 6/24 08:31:10 DaemonCore: Command received via TCP from host
| <129.79.4.13:10599>
| 6/24 08:31:10 DaemonCore: received command 442 (REQUEST_CLAIM), calling
| handler (command_request_claim)
| 6/24 08:31:10 Request accepted.
| 6/24 08:31:11 Remote owner is jrinkovs@xxxxxxxxxxxxxxxxxxxxxxxxxxx
| 6/24 08:31:11 State change: claiming protocol successful
| 6/24 08:31:11 Changing state: Matched -> Claimed
| 6/24 08:31:13 DaemonCore: Command received via TCP from host
| <129.79.4.13:11795>
| 6/24 08:31:13 DaemonCore: received command 444 (ACTIVATE_CLAIM), calling
| handler (command_activate_claim)
| 6/24 08:31:13 Got activate_claim request from shadow (<129.79.4.13:11795>)
| 6/24 08:31:13 Remote job ID is 1913.0
| 6/24 08:31:13 Got universe "VANILLA" (5) from request classad
| 6/24 08:31:13 State change: claim-activation protocol successful
| 6/24 08:31:13 Changing activity: Idle -> Busy
| 6/24 08:31:44 DaemonCore: Command received via TCP from host
| <129.79.4.13:11095>
| 6/24 08:31:44 DaemonCore: received command 5 (QUERY_STARTD_ADS), calling
| handler (command_query_ads)
| 6/24 08:31:44 In command_query_ads
| 6/24 08:31:52 DaemonCore: Command received via TCP from host
| <129.79.4.13:10904>
| 6/24 08:31:52 DaemonCore: received command 5 (QUERY_STARTD_ADS), calling
| handler (command_query_ads)
| 6/24 08:31:52 In command_query_ads
| 6/24 08:31:56 DaemonCore: Command received via TCP from host
| <129.79.4.13:11307>
| 6/24 08:31:56 DaemonCore: received command 5 (QUERY_STARTD_ADS), calling
| handler (command_query_ads)
| 6/24 08:31:56 In command_query_ads
| 6/24 08:32:59 DaemonCore: Command received via TCP from host
| <129.79.4.13:12008>
| 6/24 08:32:59 DaemonCore: received command 5 (QUERY_STARTD_ADS), calling
| handler (command_query_ads)
| 6/24 08:32:59 In command_query_ads
| 6/24 08:33:05 DaemonCore: Command received via TCP from host
| <129.79.4.13:11125>
| 6/24 08:33:05 DaemonCore: received command 5 (QUERY_STARTD_ADS), calling
| handler (command_query_ads)
| 6/24 08:33:05 In command_query_ads
| 6/24 09:02:05 DaemonCore: Command received via UDP from host
| <129.79.19.65:1126>
| 6/24 09:02:05 DaemonCore: received command 60001 (DC_PROCESSEXIT),
| calling handler (HandleProcessExitCommand())
| 6/24 09:02:05 Starter pid 584 exited with status 4
| 6/24 09:02:06 State change: starter exited
| 6/24 09:02:06 Changing activity: Busy -> Idle
| _______________________________________________
| Condor-users mailing list
| Condor-users@xxxxxxxxxxx
| http://lists.cs.wisc.edu/mailman/listinfo/condor-users
|
|

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFA2zOttSbyb4bh6DoRAlVdAJ9zZBQLeltOApctVZ8cpFdaDThbUwCg2W8+
OI2qcE7vWuwfVZmv3sL2wgs=
=8fZS
-----END PGP SIGNATURE-----