Mailing List Archives
Public Access
|
|
|
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[HTCondor-users] FileTransfer::UpLoadFiles called during active transfer
- Date: Mon, 27 Jan 2014 12:21:52 -0500
- From: Ben Cotton <ben.cotton@xxxxxxxxxxxxxxxxxx>
- Subject: [HTCondor-users] FileTransfer::UpLoadFiles called during active transfer
One of our customers reported seeing file transfer errors in some jobs
which sometimes match with missing job output and zero-byte stdout and
stderr files (note that the error message is neither necessary nor
sufficient). These are Java universe jobs with a 7.8.7 schedd and
7.8.8 execute node.
>From the HTCondor code, I can tell that file transfer appears to be
limited to single file at a time. What's unclear is what happens when
this occurs (i.e. does it retry the file transfer? It seems like the
answer is 'no') and if this is something that can be prevented. Job,
starter, and shadow logs are below. Other logs have since rotated and
are unavailable.
Thanks,
BC
### Job log
000 (258067.9454.000) 01/21 16:12:21 Job submitted from host:
<10.219.144.129:46655>
DAG Node: Main-0
...
007 (258067.9454.000) 01/22 10:37:03 Shadow exception!
Error from slot8@xxxxxxxxxxxxxxxxxxxxxx:
FileTransfer::UpLoadFiles called during active transfer!
0 - Run Bytes Sent By Job
58729088 - Run Bytes Received By Job
...
001 (258067.9454.000) 01/22 10:38:32 Job executing on host: <10.129.85.231:2651>
...
006 (258067.9454.000) 01/22 10:38:42 Image size of job updated: 275
1 - MemoryUsage of job (MB)
100 - ResidentSetSize of job (KB)
...
006 (258067.9454.000) 01/22 10:43:42 Image size of job updated: 401920
393 - MemoryUsage of job (MB)
401920 - ResidentSetSize of job (KB)
...
006 (258067.9454.000) 01/22 10:48:41 Image size of job updated: 401928
393 - MemoryUsage of job (MB)
401920 - ResidentSetSize of job (KB)
...
005 (258067.9454.000) 01/22 10:48:41 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:06:49, Sys 0 00:01:12 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:06:49, Sys 0 00:01:12 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
760696 - Run Bytes Sent By Job
58729088 - Run Bytes Received By Job
760696 - Total Bytes Sent By Job
58729088 - Total Bytes Received By Job
Partitionable Resources : Usage Request
Cpus : 1
Disk (KB) : 111128 111128
Memory (MB) : 393 393
## StarterLog
01/22/14 10:34:55 Using config source: C:\Condor\v7.8\condor_config
01/22/14 10:34:55 Using local config sources:
01/22/14 10:34:55 C:\Condor\v7.8/condor_config.local
01/22/14 10:34:55 C:\Condor\v7.8/bin\cycle_cache_config.exe
C:\Condor\local\cache_config 300 30 http://cycleserver/condor/ass
igned_template/execute682.example.com |
01/22/14 10:34:55 C:\Condor\v7.8/bin\cycle_cache_config.exe
C:\Condor\Local\cache_config 300 30 http://cycleserver/condor/ass
igned_template/execute682.example.com |
01/22/14 10:34:55 DaemonCore: command socket at <10.129.83.41:2935>
01/22/14 10:34:55 DaemonCore: private command socket at <10.129.83.41:2935>
01/22/14 10:34:55 Setting maximum accepts per cycle 8.
01/22/14 10:34:55 GLEXEC_JOB not supported on this platform; ignoring
01/22/14 10:34:55 Communicating with shadow <10.219.144.129:43076?noUDP>
01/22/14 10:34:55 Submitting machine is "server018.example.com"
01/22/14 10:34:55 setting the orig job name in starter
01/22/14 10:34:55 setting the orig job iwd in starter
01/22/14 10:34:55 Initialized IO Proxy.
01/22/14 10:34:55 Setting resource limits not implemented!
01/22/14 10:37:03 Received UDP command 60000 (DC_RAISESIGNAL) from
condor@parent <10.129.83.41:3021>, access level DAEMON
01/22/14 10:37:03 Calling HandleReq <HandleSigCommand()> (0) for
command 60000 (DC_RAISESIGNAL) from condor@parent <10.129.83.41:3021>
01/22/14 10:37:03 Return from HandleReq <HandleSigCommand()> (handler:
0.000s, sec: 0.000s, payload: 0.000s)
01/22/14 10:37:03 Got SIGTERM. Performing graceful shutdown.
01/22/14 10:37:03 ShutdownGraceful all jobs.
01/22/14 10:37:03 ERROR "FileTransfer::UpLoadFiles called during
active transfer!
" at line 1134 in file
c:\condor\execute\dir_11160\userdir\src\condor_utils\file_transfer.cpp
01/22/14 10:37:03 ShutdownFast all jobs.
01/22/14 10:37:03 condor_read() failed: recv(fd=1532) returned -1,
errno = 10053 , reading 5 bytes from <10.219.144.129:31572>.
01/22/14 10:37:03 IO: Failed to read packet header
01/22/14 10:37:05 FileTransfer object destructor called during active
transfer. Cancelling transfer.
01/22/14 10:37:53 Locale: English_United States.1252
01/22/14 10:37:53 Setting maximum accepts per cycle 8.
## Shadow log
01/22/14 10:34:55 Using config source: /opt/condor/etc/condor_config
01/22/14 10:34:55 Using local config sources:
01/22/14 10:34:55 /opt/condor/local.server018/condor_config.local
01/22/14 10:34:55 DaemonCore: command socket at <10.219.144.129:43076?noUDP>
01/22/14 10:34:55 DaemonCore: private command socket at <10.219.144.129:43076>
01/22/14 10:34:55 Setting maximum accepts per cycle 8.
01/22/14 10:34:55 Initializing a JAVA shadow for job 258067.9454
01/22/14 10:34:55 (258067.9454) (3402472): Request to run on
slot8@xxxxxxxxxxxxxxxxxxxxxx <10.129.83.41:3429> was ACCEPTED
01/22/14 10:34:55 (258067.9454) (3402472): Calling Handler <HandleSyscalls> (1)
01/22/14 10:34:55 (258067.9454) (3402472): Return from Handler
<HandleSyscalls> 0.0016s
01/22/14 10:34:55 (258067.9454) (3402472): Calling Handler <HandleSyscalls> (1)
01/22/14 10:34:55 (258067.9454) (3402472): Return from Handler
<HandleSyscalls> 0.0004s
01/22/14 10:34:55 (258067.9454) (3402472): Calling Handler
<DaemonCommandProtocol::WaitForSocketData> (2)
01/22/14 10:34:55 (258067.9454) (3402472): Received TCP command 61000
(FILETRANS_UPLOAD) from unauthenticated@unmapped <10.129.83.41:2939>,
access level WRITE
01/22/14 10:34:55 (258067.9454) (3402472): Calling HandleReq
<FileTransfer::HandleCommands()> (0) for command 61000
(FILETRANS_UPLOAD) from unauthenticated@unmapped <10.129.83.41:2939>
01/22/14 10:37:03 (258067.9454) (3402472): Return from HandleReq
<FileTransfer::HandleCommands()> (handler: 127.945s, sec: 0.000s,
payload: 0.000s)
01/22/14 10:37:03 (258067.9454) (3402472): Return from Handler
<DaemonCommandProtocol::WaitForSocketData> 127.9453s
01/22/14 10:37:03 (258067.9454) (3402472): Calling Handler <HandleSyscalls> (1)
01/22/14 10:37:03 (258067.9454) (3402472): Return from Handler
<HandleSyscalls> 0.0005s
01/22/14 10:37:03 (258067.9454) (3402472): Calling Handler <HandleSyscalls> (1)
01/22/14 10:37:03 (258067.9454) (3402472): ERROR "Error from
slot8@xxxxxxxxxxxxxxxxxxxxxx: FileTransfer::UpLoadFiles called during
active transfer!
" at line 726 in file
/slots/01/dir_15970/userdir/src/condor_shadow.V6.1/pseudo_ops.cpp
--
Ben Cotton
main: 888.292.5320
Cycle Computing
Leader in Utility HPC Software
http://www.cyclecomputing.com
twitter: @cyclecomputing