[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[HTCondor-users] FileTransfer::UpLoadFiles called during active transfer



One of our customers reported seeing file transfer errors in some jobs
which sometimes match with missing job output and zero-byte stdout and
stderr files (note that the error message is neither necessary nor
sufficient). These are Java universe jobs with a 7.8.7 schedd and
7.8.8 execute node.

>From the HTCondor code, I can tell that file transfer appears to be
limited to single file at a time. What's unclear is what happens when
this occurs (i.e. does it retry the file transfer? It seems like the
answer is 'no') and if this is something that can be prevented. Job,
starter, and shadow logs are below. Other logs have since rotated and
are unavailable.


Thanks,
BC

### Job log

000 (258067.9454.000) 01/21 16:12:21 Job submitted from host:
<10.219.144.129:46655>

    DAG Node: Main-0

...

007 (258067.9454.000) 01/22 10:37:03 Shadow exception!

        Error from slot8@xxxxxxxxxxxxxxxxxxxxxx:
FileTransfer::UpLoadFiles called during active transfer!



        0  -  Run Bytes Sent By Job

        58729088  -  Run Bytes Received By Job

...

001 (258067.9454.000) 01/22 10:38:32 Job executing on host: <10.129.85.231:2651>

...

006 (258067.9454.000) 01/22 10:38:42 Image size of job updated: 275

        1  -  MemoryUsage of job (MB)

        100  -  ResidentSetSize of job (KB)

...

006 (258067.9454.000) 01/22 10:43:42 Image size of job updated: 401920

        393  -  MemoryUsage of job (MB)

        401920  -  ResidentSetSize of job (KB)

...

006 (258067.9454.000) 01/22 10:48:41 Image size of job updated: 401928

        393  -  MemoryUsage of job (MB)

        401920  -  ResidentSetSize of job (KB)

...

005 (258067.9454.000) 01/22 10:48:41 Job terminated.

        (1) Normal termination (return value 0)

                Usr 0 00:06:49, Sys 0 00:01:12  -  Run Remote Usage

                Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage

                Usr 0 00:06:49, Sys 0 00:01:12  -  Total Remote Usage

                Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage

        760696  -  Run Bytes Sent By Job

        58729088  -  Run Bytes Received By Job

        760696  -  Total Bytes Sent By Job

        58729088  -  Total Bytes Received By Job

        Partitionable Resources :    Usage  Request

           Cpus                 :                 1

           Disk (KB)            :   111128   111128

           Memory (MB)          :      393      393


## StarterLog

01/22/14 10:34:55 Using config source: C:\Condor\v7.8\condor_config
01/22/14 10:34:55 Using local config sources:
01/22/14 10:34:55    C:\Condor\v7.8/condor_config.local
01/22/14 10:34:55    C:\Condor\v7.8/bin\cycle_cache_config.exe
C:\Condor\local\cache_config 300 30 http://cycleserver/condor/ass
igned_template/execute682.example.com |
01/22/14 10:34:55    C:\Condor\v7.8/bin\cycle_cache_config.exe
C:\Condor\Local\cache_config 300 30 http://cycleserver/condor/ass
igned_template/execute682.example.com |
01/22/14 10:34:55 DaemonCore: command socket at <10.129.83.41:2935>
01/22/14 10:34:55 DaemonCore: private command socket at <10.129.83.41:2935>
01/22/14 10:34:55 Setting maximum accepts per cycle 8.
01/22/14 10:34:55 GLEXEC_JOB not supported on this platform; ignoring
01/22/14 10:34:55 Communicating with shadow <10.219.144.129:43076?noUDP>
01/22/14 10:34:55 Submitting machine is "server018.example.com"
01/22/14 10:34:55 setting the orig job name in starter
01/22/14 10:34:55 setting the orig job iwd in starter
01/22/14 10:34:55 Initialized IO Proxy.
01/22/14 10:34:55 Setting resource limits not implemented!
01/22/14 10:37:03 Received UDP command 60000 (DC_RAISESIGNAL) from
condor@parent <10.129.83.41:3021>, access level DAEMON
01/22/14 10:37:03 Calling HandleReq <HandleSigCommand()> (0) for
command 60000 (DC_RAISESIGNAL) from condor@parent <10.129.83.41:3021>
01/22/14 10:37:03 Return from HandleReq <HandleSigCommand()> (handler:
0.000s, sec: 0.000s, payload: 0.000s)
01/22/14 10:37:03 Got SIGTERM. Performing graceful shutdown.
01/22/14 10:37:03 ShutdownGraceful all jobs.
01/22/14 10:37:03 ERROR "FileTransfer::UpLoadFiles called during
active transfer!
" at line 1134 in file
c:\condor\execute\dir_11160\userdir\src\condor_utils\file_transfer.cpp
01/22/14 10:37:03 ShutdownFast all jobs.
01/22/14 10:37:03 condor_read() failed: recv(fd=1532) returned -1,
errno = 10053 , reading 5 bytes from <10.219.144.129:31572>.
01/22/14 10:37:03 IO: Failed to read packet header
01/22/14 10:37:05 FileTransfer object destructor called during active
transfer.  Cancelling transfer.
01/22/14 10:37:53 Locale: English_United States.1252
01/22/14 10:37:53 Setting maximum accepts per cycle 8.


## Shadow log

01/22/14 10:34:55 Using config source: /opt/condor/etc/condor_config
01/22/14 10:34:55 Using local config sources:
01/22/14 10:34:55    /opt/condor/local.server018/condor_config.local
01/22/14 10:34:55 DaemonCore: command socket at <10.219.144.129:43076?noUDP>
01/22/14 10:34:55 DaemonCore: private command socket at <10.219.144.129:43076>
01/22/14 10:34:55 Setting maximum accepts per cycle 8.
01/22/14 10:34:55 Initializing a JAVA shadow for job 258067.9454
01/22/14 10:34:55 (258067.9454) (3402472): Request to run on
slot8@xxxxxxxxxxxxxxxxxxxxxx <10.129.83.41:3429> was ACCEPTED
01/22/14 10:34:55 (258067.9454) (3402472): Calling Handler <HandleSyscalls> (1)
01/22/14 10:34:55 (258067.9454) (3402472): Return from Handler
<HandleSyscalls> 0.0016s
01/22/14 10:34:55 (258067.9454) (3402472): Calling Handler <HandleSyscalls> (1)
01/22/14 10:34:55 (258067.9454) (3402472): Return from Handler
<HandleSyscalls> 0.0004s
01/22/14 10:34:55 (258067.9454) (3402472): Calling Handler
<DaemonCommandProtocol::WaitForSocketData> (2)
01/22/14 10:34:55 (258067.9454) (3402472): Received TCP command 61000
(FILETRANS_UPLOAD) from unauthenticated@unmapped <10.129.83.41:2939>,
access level WRITE
01/22/14 10:34:55 (258067.9454) (3402472): Calling HandleReq
<FileTransfer::HandleCommands()> (0) for command 61000
(FILETRANS_UPLOAD) from unauthenticated@unmapped <10.129.83.41:2939>
01/22/14 10:37:03 (258067.9454) (3402472): Return from HandleReq
<FileTransfer::HandleCommands()> (handler: 127.945s, sec: 0.000s,
payload: 0.000s)
01/22/14 10:37:03 (258067.9454) (3402472): Return from Handler
<DaemonCommandProtocol::WaitForSocketData> 127.9453s
01/22/14 10:37:03 (258067.9454) (3402472): Calling Handler <HandleSyscalls> (1)
01/22/14 10:37:03 (258067.9454) (3402472): Return from Handler
<HandleSyscalls> 0.0005s
01/22/14 10:37:03 (258067.9454) (3402472): Calling Handler <HandleSyscalls> (1)
01/22/14 10:37:03 (258067.9454) (3402472): ERROR "Error from
slot8@xxxxxxxxxxxxxxxxxxxxxx: FileTransfer::UpLoadFiles called during
active transfer!
" at line 726 in file
/slots/01/dir_15970/userdir/src/condor_shadow.V6.1/pseudo_ops.cpp



-- 
Ben Cotton
main: 888.292.5320

Cycle Computing
Leader in Utility HPC Software

http://www.cyclecomputing.com
twitter: @cyclecomputing