"handleCmdIO" error on large dataset

General discussion of the Co:Z Toolkit
Post Reply
ek1975
Posts: 23
Joined: Tue Sep 30, 2008 10:04 pm

"handleCmdIO" error on large dataset

Post by ek1975 »

Hello, Like dovetail showed me in the other example http://dovetail.com/forum/viewtopic.php?t=919, I've got my Co:z to call a Java program using an INPUT DD dataset

Everything worked fine until I decided to stress test the application today. I passed a dataset containing 23,679 records. After processing 10,110 records, the application abended on the following error.

fromdsn-client[E]: handleCmdIO: error: EOF after receiving partial packet from CoZServer
fromdsn-client[E]: Error: no exit code received from CoZServer

I tried googling, but no luck. If any of you could throw a hint or two as to how to resolve this error, I would really appreciate it. Thanks!
dovetail
Site Admin
Posts: 2022
Joined: Thu Jul 29, 2004 12:12 pm

Post by dovetail »

23K records is not really a very large dataset and many customers pipe datasets with millions of records with good sucess. It is certainly possible to receive an error like this if the ssh connection between z/OS and the remote system is lost, but it could also happen if there were an I/O error encountered by the z/OS fromdsn program while reading the dataset.

The error messages that you are providing are those that are written to the remote shell's stderr (which is redirected by default to the STDERR DD in the launching job), and these messages only indicate that the connection to the fromdsn program on the z/OS server was interrupted. Were there any other messages in the z/OS job log to indicate the root cause?

If this problem can be reproduced, then there may be another cause, including even a bug in the code(!). In this case, let us know that you are not receiving any other messages but can reproduce it - we can provide you with some diagnostic trace settings to gather more information.
ek1975
Posts: 23
Joined: Tue Sep 30, 2008 10:04 pm

Post by ek1975 »

Thanks Dovetail, I am posting the JCL and the JES LOG output here. The one common thing I'm finding is the CPU time for the job is always 0.09 or 0.1 (rarely) when the job abends, My sysadmn is looking the config files on either side. BTW, Is there a way to attach documents while posting on this forum? Thanks.

JCL

Code: Select all

//AD@XXXCZ  JOB (HGYUDEVL),'PTERNI',MSGLEVEL=(1,1),       
//          MSGCLASS=H,CLASS=D,NOTIFY=&SYSUID             
//PROCLIB JCLLIB ORDER='SYS2.COZ.SAMPJCL'                 
//*                                                       
//CRTEPDF EXEC PROC=COZPROC,ARGS='test01@135.458.698.542'  
//INPUT  DD DISP=SHR,DSN=FILE01.MAY17.AVBG05B.BIG
//STDOUT DD SYSOUT=*                                                 
//COZCFG DD *                                                        
ssh-options=-F /u/AD@XXX/config                                      
//STDIN DD *                                                         
export CLASSPATH=$CLASSPATH:/opt/java/jar/:                          
fromdsn //DD:INPUT | java -jar /opt/java/jar/Dpj01x.jar              
//*                                                                  
//FTPFILES EXEC PROC=COZPROC,ARGS='test01@135.458.698.542',COND=(0,NE)
//STDOUT DD SYSOUT=*                                                 
//COZCFG DD *                                                        
ssh-options=-F /u/AD@XXX/config                                      
//STDIN DD *                                                         
/opt/scripts/ftptest.sh
//*

JES LOG:

Code: Select all

                    J E S 2  J O B  L O G  --  S Y S T E M  D E V L  --  N O D E  I S O S S Y S
 
09.37.42 JOB14174 ---- TUESDAY,   18 MAY 2010 ----
09.37.42 JOB14174  IRR010I  USERID AD@XXX   IS ASSIGNED TO THIS JOB.
09.37.43 JOB14174  ICH70001I AD@XXX   LAST ACCESS AT 09:34:17 ON TUESDAY, MAY 18, 2010
09.37.43 JOB14174  $HASP373 AD@XXXCZ STARTED - WLM INIT  - SRVCLASS BATNORML - SYS DEVL
09.37.43 JOB14174  IEF403I AD@XXXCZ - STARTED - TIME=09.37.43
09.49.21 JOB14174  BPXP023I THREAD 0DD2C00000000000, IN PROCESS 16777655, WAS  867
   867             TERMINATED BY SIGNAL SIGKILL, SENT FROM THREAD
   867             0DD2800000000000, IN PROCESS 67109726, UID 203031245, IN JOB AD@XXXCZ.
09.49.34 JOB14174  -                                         --TIMINGS (MINS.)--            ----PAGING COUNTS---
09.49.34 JOB14174  -JOBNAME  STEPNAME PROCSTEP    RC   EXCP    CPU    SRB  CLOCK   SERV  PG   PAGE   SWAP    VIO SWAPS STEPNO
09.49.34 JOB14174  -AD@XXXCZ CRTEPDF  COZLNCH     01  18457    .09    .00  11.85  56137   0      0      0      0     0     1
09.49.34 JOB14174  -AD@XXXCZ FTPFILES COZLNCH  FLUSH      0    .00    .00    .00      0   0      0      0      0     0     2
09.49.34 JOB14174  IEF404I AD@XXXCZ - ENDED - TIME=09.49.34
09.49.34 JOB14174  -AD@XXXCZ ENDED.  NAME-PTERNI               TOTAL CPU TIME=   .09  TOTAL ELAPSED TIME= 11.85
09.49.34 JOB14174  $HASP395 AD@XXXCZ ENDED
------ JES2 JOB STATISTICS ------
  18 MAY 2010 JOB EXECUTION DATE
           44 CARDS READ
          193 SYSOUT PRINT RECORDS
            0 SYSOUT PUNCH RECORDS
           12 SYSOUT SPOOL KBYTES
        11.85 MINUTES EXECUTION TIME
        1 //AD@XXXCZ  JOB (HGYUDEVL),'PTERNI',MSGLEVEL=(1,1),                     JOB14174
          //          MSGCLASS=H,CLASS=D,NOTIFY=&SYSUID
          IEFC653I SUBSTITUTION JCL - (HGYUDEVL),'PTERNI',MSGLEVEL=(1,1),MSGCLASS=H,CLASS=D,NOTIFY=AD@XXX
        2 //PROCLIB JCLLIB ORDER='SYS2.COZ.SAMPJCL'
          //*
          //*********************************************************************
          //*
        3 //CRTEPDF EXEC PROC=COZPROC,ARGS='test01@135.458.698.542'
          XX*********************************************************************
          XX*
          XX* Stored procedure for executing the COZ Launcher
          XX*
          XX* Tailor the proc your installation:
          XX* 1.) Tailor LIBRARY with the PDSE that contains the
          XX*     COZ load module.
          XX* 2.) Tailor COZCFGD= to point to SAMPJCL config member
          XX*********************************************************************
        4 XXEXCOZ    PROC ARGS=,                       < [-L<log_opt>]
          XX   LIBRARY='SYS2.COZ.LOADLIB',          < STEPLIB FOR COZLNCH
          XX   COZCFGD='SYS2.COZ.SAMPJCL(COZCFGD)', < Installation defaults
          XX   REGSIZE='32M',                          < Execution region size
          XX   LEPARM=''
        5 XXCOZLNCH  EXEC PGM=COZLNCH,REGION=&REGSIZE,
          XX   PARM='&LEPARM/&ARGS'
          IEFC653I SUBSTITUTION JCL - PGM=COZLNCH,REGION=32M,PARM='/test01@135.458.698.542'
        6 XXSTEPLIB  DD DSN=&LIBRARY,DISP=SHR
          IEFC653I SUBSTITUTION JCL - DSN=SYS2.COZ.LOADLIB,DISP=SHR
        7 XXCOZCONF  DD DSN=&COZCFGD,DISP=SHR
          IEFC653I SUBSTITUTION JCL - DSN=SYS2.COZ.SAMPJCL(COZCFGD),DISP=SHR
        8 XX         DD DDNAME=COZCFG
        9 //COZCFG DD *
          X/COZCFG   DD DUMMY                     < override for addtnl config
       10 XXSYSOUT   DD SYSOUT=*                  < System stderr
       11 XXSYSPRINT DD SYSOUT=*,                 < Agent stdout
          XX            DCB=(RECFM=VB,LRECL=255,BLKSIZE=10000)
       12 //STDIN DD *
          X/STDIN    DD DUMMY                     < Target app stdin
       13 //STDOUT DD SYSOUT=*
          X/STDOUT   DD SYSOUT=*,                 < Target app stdout
          X/            DCB=(RECFM=VB,LRECL=255,BLKSIZE=10000)
       14 XXSTDERR   DD SYSOUT=*,                 < Target app stderr
          XX            DCB=(RECFM=VB,LRECL=255,BLKSIZE=10000)
          XX*
       15 //INPUT  DD DISP=SHR,DSN=FILE01.MAY17.AVBG05B.BIG
       16 XX PEND
          //*
          //*
       17 //FTPFILES EXEC PROC=COZPROC,ARGS='test01@135.458.698.542',COND=(0,NE)
          XX*********************************************************************
          XX*
          XX* Stored procedure for executing the COZ Launcher
          XX*
          XX* Tailor the proc your installation:
          XX* 1.) Tailor LIBRARY with the PDSE that contains the
          XX*     COZ load module.
          XX* 2.) Tailor COZCFGD= to point to SAMPJCL config member
          XX*********************************************************************
       18 XXEXCOZ    PROC ARGS=,                       < [-L<log_opt>]
          XX   LIBRARY='SYS2.COZ.LOADLIB',          < STEPLIB FOR COZLNCH
          XX   COZCFGD='SYS2.COZ.SAMPJCL(COZCFGD)', < Installation defaults
          XX   REGSIZE='32M',                          < Execution region size
          XX   LEPARM=''
       19 XXCOZLNCH  EXEC PGM=COZLNCH,REGION=&REGSIZE,
          XX   PARM='&LEPARM/&ARGS'
          IEFC653I SUBSTITUTION JCL - PGM=COZLNCH,REGION=32M,PARM='/test01@135.458.698.542'
       20 XXSTEPLIB  DD DSN=&LIBRARY,DISP=SHR
          IEFC653I SUBSTITUTION JCL - DSN=SYS2.COZ.LOADLIB,DISP=SHR
       21 XXCOZCONF  DD DSN=&COZCFGD,DISP=SHR
          IEFC653I SUBSTITUTION JCL - DSN=SYS2.COZ.SAMPJCL(COZCFGD),DISP=SHR
       22 XX         DD DDNAME=COZCFG
       23 //COZCFG DD *
          X/COZCFG   DD DUMMY                     < override for addtnl config
       24 XXSYSOUT   DD SYSOUT=*                  < System stderr
       25 XXSYSPRINT DD SYSOUT=*,                 < Agent stdout
          XX            DCB=(RECFM=VB,LRECL=255,BLKSIZE=10000)
       26 //STDIN DD *
          X/STDIN    DD DUMMY                     < Target app stdin
       27 //STDOUT DD SYSOUT=*
          X/STDOUT   DD SYSOUT=*,                 < Target app stdout
          X/            DCB=(RECFM=VB,LRECL=255,BLKSIZE=10000)
       28 XXSTDERR   DD SYSOUT=*,                 < Target app stderr
          XX            DCB=(RECFM=VB,LRECL=255,BLKSIZE=10000)
          XX*
       29 XX PEND
          //*
          //*********************************************************************
 STMT NO. MESSAGE
        3 IEFC001I PROCEDURE COZPROC WAS EXPANDED USING PRIVATE LIBRARY SYS2.COZ.SAMPJCL
       17 IEFC001I PROCEDURE COZPROC WAS EXPANDED USING PRIVATE LIBRARY SYS2.COZ.SAMPJCL
ICH70001I AD@XXX   LAST ACCESS AT 09:34:17 ON TUESDAY, MAY 18, 2010
IEF236I ALLOC. FOR AD@XXXCZ COZLNCH CRTEPDF
IEF237I C10C ALLOCATED TO STEPLIB
IEF237I C10C ALLOCATED TO COZCONF
IEF237I JES2 ALLOCATED TO
IEF237I JES2 ALLOCATED TO SYSOUT
IEF237I JES2 ALLOCATED TO SYSPRINT
IEF237I JES2 ALLOCATED TO STDIN
IEF237I JES2 ALLOCATED TO STDOUT
IEF237I JES2 ALLOCATED TO STDERR
IGD103I SMS ALLOCATED TO DDNAME INPUT
IGD103I SMS ALLOCATED TO DDNAME SYS00001
IGD104I HFS FILE WAS RETAINED, DDNAME IS (SYS00001)
FILENAME IS (/etc/resolv.conf)
IEF237I C339 ALLOCATED TO SYS00003
IEF285I   TCPDEVL.HGYU.STANDARD.TCPXLBIN               KEPT
IEF285I   VOL SER NOS= DVLSY7.
IGD103I SMS ALLOCATED TO DDNAME SYS00004
IGD104I HFS FILE WAS RETAINED, DDNAME IS (SYS00004)
FILENAME IS (/etc/resolv.conf)
IEF237I C339 ALLOCATED TO SYS00006
IEF285I   TCPDEVL.HGYU.STANDARD.TCPXLBIN               KEPT
IEF285I   VOL SER NOS= DVLSY7.
BPXP023I THREAD 0DD2C00000000000, IN PROCESS 16777655, WAS
TERMINATED BY SIGNAL SIGKILL, SENT FROM THREAD
0DD2800000000000, IN PROCESS 67109726, UID 203031245, IN JOB AD@XXXCZ.
IEF142I AD@XXXCZ COZLNCH CRTEPDF - STEP WAS EXECUTED - COND CODE 0001
IEF285I   SYS2.COZ.LOADLIB                             KEPT
IEF285I   VOL SER NOS= DVL3P3.
IEF285I   SYS2.COZ.SAMPJCL                             KEPT
IEF285I   VOL SER NOS= DVL3P3.
IEF285I   AD@XXX.AD@XXXCZ.JOB14174.D0000101.?          SYSIN
IEF285I   AD@XXX.AD@XXXCZ.JOB14174.D0000105.?          SYSOUT
IEF285I   AD@XXX.AD@XXXCZ.JOB14174.D0000106.?          SYSOUT
IEF285I   AD@XXX.AD@XXXCZ.JOB14174.D0000102.?          SYSIN
IEF285I   AD@XXX.AD@XXXCZ.JOB14174.D0000107.?          SYSOUT
IEF285I   AD@XXX.AD@XXXCZ.JOB14174.D0000108.?          SYSOUT
IGD104I FILE01.MAY17.AVBG05B.BIG            RETAINED,  DDNAME=INPUT
IEF373I STEP/COZLNCH /START 2010138.0937
IEF374I STEP/COZLNCH /STOP  2010138.0949 CPU    0MIN 05.86SEC SRB    0MIN 00.05SEC VIRT   160K SYS   308K EXT   15672K SYS   13856K
IEF202I AD@XXXCZ COZLNCH FTPFILES - STEP WAS NOT RUN BECAUSE OF CONDITION CODES
IEF272I AD@XXXCZ COZLNCH FTPFILES - STEP WAS NOT EXECUTED.
IEF373I STEP/COZLNCH /START 2010138.0949
IEF374I STEP/COZLNCH /STOP  2010138.0949 CPU    0MIN 00.00SEC SRB    0MIN 00.00SEC VIRT     0K SYS     0K EXT       0K SYS       0K
IEF375I  JOB/AD@XXXCZ/START 2010138.0937
IEF376I  JOB/AD@XXXCZ/STOP  2010138.0949 CPU    0MIN 05.86SEC SRB    0MIN 00.05SEC
CoZLauncher[N]: version: 1.5.3 2010-01-19
CoZLauncher[N]: Copyright (C) Dovetailed Technologies, LLC. 2006. All rights reserved.
cozagent[N]: version: 1.0.5 2008-03-01
fromdsn(DD:STDIN)[N]: 2 records/160 bytes read; 100 bytes written in 0 milliseconds.
cozserver(4)[E]: ./OutputMux.h(131) - write_fully(1, E1A3ED8, 4) - [140] EDC5140I Broken pipe. (errno2=0x74520442)
CEE5213S The signal SIGPIPE was received. 
todsn(DD:STDOUT)[N]: 0 bytes read; 0 records/0 bytes written in 689.830 seconds (0.000 Bytes/sec).
cozagent[E]: Target Program(4454) ended with RC=1
todsn(DD:STDERR)[N]: 152 bytes read; 2 records/150 bytes written in 689.866 seconds (0.220 Bytes/sec).
CoZLauncher[E]: test01@135.458.698.542 target command '<default shell>' ended with RC=1
CoZAgent: admin01@javaweb1 target program '/bin/bash' PID: 4454
CoZAgent: completed with RC=1
fromdsn-client[E]: handleCmdIO: error: EOF after receiving partial packet from CoZServer
fromdsn-client[E]: Error: no exit code received from CoZServer 
dovetail
Site Admin
Posts: 2022
Joined: Thu Jul 29, 2004 12:12 pm

Post by dovetail »

Is this a slow link? Its taking 11 minutes to run, for transferring and processing 10K records... this seems very slow.

One thing you might try is to eliminate the Java program from the equation:

fromdsn //DD:INPUT > /dev/null

In your joblog, it look like the socket used for the "DD:INPUT" data is being terminated. (This socket is tunneled through the ssh connection.)

You may want to check the ssh/sftp logs on the remote server to see if there is any error information there. If this is a Unix server with OpenSSH, these errors would likely be in the "syslog", depending on the configuration.

We probably need a trace to figure out what is going wrong.

To get traces, change the following lines in your job:

//CRTEPDF EXEC PROC=COZPROC,ARGS='-LT,t test01@135.458.698.542'

//COZCFG DD *
ssh-options=-vvv -F /u/AD@XXX/config
agent-options=-LT,t
target-env-COZ_LOG=T,t


These traces are likely to be large, so you can zip and email the logs to info@dovetail.com
ek1975
Posts: 23
Joined: Tue Sep 30, 2008 10:04 pm

Post by ek1975 »

Thanks dovetail. I have sent you the debug trace reports to info@dovetail.com
zStefan
Posts: 15
Joined: Thu Jun 10, 2010 6:14 am

Post by zStefan »

Hi, Did any reason ever get disovered for this? I am having the exact same symptoms for what really is quite a large dataset. (Tens of gigabytes). With plain SFTP put I don't get a problem, but when I use the fromdsn technique to pipe into gzip or bzip2 on a linux server, I get a broken pipe after various lengths of time. Last attempt was after 2.4 Gb with transmission running steady at about 3.4 Mbytes/sec.
There don't seem to be any indications of problems in any logs at the Linux end. In STDERR at the z/OS end I am seeing
fromdsn-clientÝE¨: handleCmdIO: error: EOF after receiving partial packet from CoZServer
Any help gratefully appreciated.
zStefan
Posts: 15
Joined: Thu Jun 10, 2010 6:14 am

Post by zStefan »

Some more information. After further experiments it looks like some kind of time limit problem. All my fromdsn commands are being terminated after about the same length of time (670 seconds approx), this is independant of the size of the dataset, the speed of the network or the platform that is issuing the fromdsn. Now I'll look around for what is imposing the time limit, but if anyone knows straight away, that would be great.
Thanks.
nge88
Posts: 5
Joined: Mon Feb 22, 2010 6:56 am

Post by nge88 »

zStefan, I experimented the same issue, with a large dataset (near 4GB), and with a near 2 gb Dataset. The key seems to be time; in fact out average connection speed in around 7-8 Mbyte/s, but when for some reasons the job execute more than 670 seconds fromdsn utility end with 102 Return Code. We're experimenting this with a cygwin distribuited client, but i think it's a platform-indipendent issue (or limit).
dovetail
Site Admin
Posts: 2022
Joined: Thu Jul 29, 2004 12:12 pm

Post by dovetail »

This is a problem that seems to occur in some environments, but we have not been able to reproduce it in our test environment.

We looked at traces provided by ek1975, and it looks like the "port forwarding" socket managed by IBM Ported Tools OpenSSH is dropping the connection.

For ek1975, we discussed his setup and determined that he was much better off using the "ssh-tunnel=false" feature, which routes data connections in the clear rather than through the ssh connection, since his connection was over a hiper-socket.

So, if you don't require encyption of your data and can use ssh-tunnel=false, you will not only avoid this issue but get much better throughput with lower CPU utilization.

If you need encryption, this is not an option. We suggest that you open a PMR with IBM against Ported Tools OpenSSH. You can contact us at info@dovetail.com and we will assist with this process.
zStefan
Posts: 15
Joined: Thu Jun 10, 2010 6:14 am

Post by zStefan »

Thanks dovetail, with ssh-tunnel=false, my fromdsn is working OK with large files. This is all within our datacentre, so I should be able to convince my users that this will be OK for now. Meanwhile, I will try the latest version of Open SSH with the recent Ported Tools release first, and if there's still a problem, I'll take up your offer of assistance to raise a problem with IBM.
Thanks again.
dovetail
Site Admin
Posts: 2022
Joined: Thu Jul 29, 2004 12:12 pm

Post by dovetail »

We suspect that there might be an issue with SSH timing out, since although you are sending lots of data on the port forwarder tunnel, you aren't sending anything on the main connection. So we have added a feature to the latest Co:Z Target Toolkit beta (1.0.9) which will send a log message on the main connection every 5 minutes.

If possible, please try your test with Co:Z Launcher 1.7.4 and Target Toolkit 1.0.9, which we released as "Beta" today in a test environment using SSH tunnels and let us know if this is fixed. We also suggest that you might want to try the new version IBM Ported Tools Openssh (1.2) to see if that helps.
zStefan
Posts: 15
Joined: Thu Jun 10, 2010 6:14 am

Post by zStefan »

I've tried the new beta releases of both z/OS and target components, but still have the same problem. I haven't tried latest Ported Tools SSH yet. Will do so soon.
tsdjim
Posts: 64
Joined: Fri May 07, 2010 2:21 am

handleCMDIO on large dataset

Post by tsdjim »

I am having a similar problem for a large file, any resolution so far?

CoZLauncher?N?: version: 1.7.5 2010-12-13
CoZLauncher?N?: Copyright (C) Dovetailed Technologies, LLC. 2006. All rights res
cozagent?N?: version: 1.0.11 2010-10-27
fromdsn(DD:STDIN)?N2 :? records/160 bytes read; 78 bytes written in 0 millisecon
cozagent ?14:10? Waiting for completion...
cozagent ?14:15? Waiting for completion...
cozagent?E?: Target Program(61332) ended with RC=102
todsn(DD:STDOUT)?N0 :? bytes read; 0 records/0 bytes written in 685.269 seconds
todsn(DD:STDERR)?N166 :? bytes read; 2 records/164 bytes written in 685.272 seco
CoZLauncher?E?: yyyyyy@11.11.11.11 target command '<default shell>' ended with RC 102
cozserver(4)?E?: ./OutputMux.h(118) - write_fully(1, 28419ED8, 4) - ?140? EDC5140I Broken pipe. (errno2=0x74520442)
cozserver(4)?E?: ./OutputMux.h(118) - write_fully(1, 2839A300, 5) - ?140? EDC5140I Broken pipe. (errno2=0x76697242)
cozserver(4)?W?: Exception: 103 sending exit code packet(rc=103) to client
CoZAgent:yyyyyyy@11.11.11.11'/bin/bash' PID: 61332
CoZAgent: completed with RC=102
fromdsn-client(61342)?E?: handleCmdIO: error: EOF after receiving partial packet
fromdsn-client(61342)?E?: Error: no exit code received from CoZServer
dovetail
Site Admin
Posts: 2022
Joined: Thu Jul 29, 2004 12:12 pm

Post by dovetail »

Are you using ssh-tunnel=true or false?
What version of IBM Ported Tools OpenSSH are you using?

There are a couple of possible reasons for this kind of problem:

1) Firewalls that drop connections
If you can reproduce the problem, you can often use "Wireshark" or similar tools to find where the connection is being dropped.

2) Less likely, but there was a potential timing problem fixed in the 1.0.12 release of the target system toolkit (available for download on the website).
Post Reply