Automating Data Purges: A scenario using bulkload

By John D. Johnson.

A commercial, upgraded version of the utility presented here is available for a reasonable cost from the author's company, including an updated interface and support for multi-volume databases. Contact jdj@jdjohnson.com or (214) 733-4958.

This article originally appeared in Progressions, a Progress-only magazine published by White Star Software.

It seems that one of the 'Universal Truths' of the MIS world is that we are always 'playing catch-up'. Under such conditions, things which should be included as part of the development process are often neglected or done in a haphazard fashion. One of the most frequently neglected part of an organization's processing is the purging of outdated data. Purges are usually performed in a reactive, rather than proactive, mode - often scheduled in response to a developing shortage of disk space or user complaints about poor performance. Many times this approach results from the fact that purges require 'manual' intervention by the DBA staff during off-hours. The approach described in the following paragraphs was developed to relieve our DBA staff of much of the burden of maintaining the roughly 20 Gb of data distributed among 60 Production databases on 10 Unix servers by automating the process to the point where our Computer Operators can assume much of the responsibility for its execution.

This process employs the dump/reload approach to take advantage of the resulting compaction of the database, along with performance improvements brought about by the physical reorganization of the data and indexes. Consequently, our purge utility consists of two separate, but related, functional pieces - A dump utility and a reload function. The dump utility can be run as any number of concurrent sessions, each dumping different files, but only one reload process is necessary, The core programs are designed to be run against any database on any system without modification. The only programs in this process that must be written specifically for each system are those which deal with specific record selection criteria.

The dump portion of this procedure, started in a Quality-Assurance database (Figure 1), consists of three preparatory programs (out-st.p, list-bld.p, and fd-bld.p), one control program , a generic dump program, and a series of system-specific purge programs. The out-st.p program (Figure 2) provides any entry point into the dump process. It performs as follows:

1. It determines whether or not it is the first copy being executed by checking the ctrl-ind field of the dump-ctrl record. If another copy is already running, control passes to Step 7.

2. User is prompted to update fields in the dump-ctrl record for the path to the database to be dumped, the destination for the output files, the path to the new database, and the directory to be used for the index-rebuild sort files.

3. Connection is made to the target database.

4. The dump-ctrl.ctrl-ind is set to 'y'.

5. list-bld.p is run, which reads the schema information from the target database and creates or resets the appropriate records in the QA database.

6. fd-bld.p is run, which uses the schema information from the target database to generate the bulkload fd files that will be needed to load the new database.

7. Runs the purge control program - purg-ctrl.p.

The purg-ctrl.p program (Figure 3) determines which target database file should be dumped next by reading the load-qa file of the QA database, finding the first record where the dump indicator = 'no' and the link indicator = 'no'. [NOTE: The link indicator field is set to 'yes' for all files which do not get dumped at all or are dumped in conjunction with other files. This has been predetermined by the DBA Group and is keyed through the use of dump-names beginning with 'x'.] For each file, it determines whether or not a specific purge program exists. If it does, that program is executed; otherwise, the control program executes the generic dump program (Figure 4), passing to it the name of the file to be dumped and the name of the output file to be used.

Upon its successful completion, each dump/purge progam outputs data to a file in a 'done' directory. The name of the output file is equivalent to the name of the bulkload 'fd' file which is needed to load the data into the new database. The file contains 1 record, listing the output record counts for the program and the date and time of program completion. The file serves three purposes: 1) its existence in the 'done' directory is the key used by the load process to signify that the data is available for import into the new database. 2) the filename is used by the load process to identify the appropriate bulkload 'fd' file to use for loading this data. 3) The data contained within this file can be used for Quality Assurance purposes by providing a written record of program execution and output count(s).

The load process is controlled by the execution of the Unix shellscript - bulkload.sh. (Figure 5) This script performs as follows:

A. It tests for the existence of data available for loading by checking the pop-ulation of the 'done' directory. If this directory is unpopulated, the script sleeps for 1 minute before checking again.

B. If the directory is populated, the script processes the entries in the directory in FIFO fashion.

C. For each entry, the script first executes the proutil bulkload option, using the name of the directory entry to determine the bulkload 'fd' file to use.

D. It then executes the PROGRESS bi-bld.p program to cause the generation of a .bi file entry. This has been found to necessary in the Pentium environment because the bulkload utility has a tendency to fail in reading the fd file unless it is slowed down by the need to first truncate the .bi file.

E. After building a .bi file for the new database, the script moves the processed key file from the 'done' directory to a 'history' directory. This ensures that the file will not be processed again while retaining it for its QA purposes.

F. The script appends the key filename to a 'loaded' file and checks the number of files which have been loaded. Once the count of the loaded file equals the total necessary for that particular database, the script terminates.

Once the bulkload.sh script has completed its job, it starts the load-qa.p program (Figure 6), which updates the load-qa file with input counts and performs comparisons between the output and input counts to ensure that all pertinent data has been successfully loaded. In the event of Quality-Assurance errors, the operator is provided with information regarding the source of the errors and can take corrective action. If all checks are passed, load-qa.p will build the appropriate proutil command line for performing the index rebuild and execute it.

The automation of such "mundane" tasks as this can go a long way toward maintaining or improving the reputation of an MIS department by maintaining reasonable systems response, limiting downtime, and providing the developers with more time to address more "important" issues.

Figure 1: Quality-Assurance Schema

Database: qa (PROGRESS)
Filename: dump-ctrl
Order Field-Name                       dType Ext Flgs Format          Initial
----- -------------------------------- --------- ---- --------------- ----------
   10 ctrl-ind                         char      i x  X
   20 dbase                            char           x(60)
   30 odest                            char           x(50)
   40 sdir                             char           x(50)
   50 new-db                           char           x(60)

Index Name (* indicates primary)  Unique Field Name                 Seq Asc Abbr
--------------------------------- ------ -------------------------- --- --- ----
ctrl-idx*                         yes    ctrl-ind                   1   yes  no

Filename: load-qa
Order Field-Name                       dType Ext Flgs Format          Initial
----- -------------------------------- --------- ---- --------------- ----------
   10 file-name                        char      i    x(20)
   20 dump-name                        char           x(12)
   30 out-cnt                          deci0          >>>,>>>,>>9     0
   40 in-cnt                           deci0          >>,>>>,>>9      0
   50 dump-ind                         logi           yes/no          no
   60 dump-prog                        char           x(15)
   70 link-ind                         logi           yes/no          no

Index Name (* indicates primary)  Unique Field Name                 Seq Asc Abbr
--------------------------------- ------ -------------------------- --- --- ----
lqa-idx*                          yes    file-name                  1   yes no

Figure 2:out-st.p Program Listing

 /* PROGRAM: out-st.p
    PURPOSE: To control entry into the automated purge process
             and perform the appropriate database connections.
    AUTHOR:   J. D. Johnson                DATE: 08/10/94
    MODIFIED:
    */
 {ctrl/dmp-var.i "NEW"}
 do transaction:
   find first dump-ctrl where dump-ctrl.ctrl-ind = 'x' exclusive
   no-wait no-error.
   if available dump-ctrl then do on error undo, retry:
     update skip(1)
     "Database to be Purged" skip
      dump-ctrl.dbase skip(1)
      "Output destination for dump files"
      dump-ctrl.odest skip(1)
             "Directory for Index-Build Sort" skip
             dump-ctrl.sdir skip(1)
             "Full Path to New Database" skip
             dump-ctrl.new-db
      skip(1)
      with centered no-labels row 3 title "Purge Startup" frame ost1.
 
     if r-index(dump-ctrl.odest, "/") = length(dump-ctrl.odest) then
       assign dump-ctrl.odest =
        substring(dump-ctrl.odest, 1, length(dump-ctrl.odest) - 1).
     connect value(dump-ctrl.dbase) -RO -ld prod no-error.
     if not connected ("prod") then do:
       message " Could not connect to requested database.  Try again. ".
       undo, retry.
     end.
     assign dump-ctrl.ctrl-ind = 'y'.
     status default "Running list-bld.p".
     run ctrl/list-bld.p.
     status default "Running fd-bld.p".
     run ctrl/fd-bld.p.
   end.
 end.
 status default.
 find first dump-ctrl no-lock no-error.
 if not connected ("prod") then do:
   connect value(dump-ctrl.dbase) -RO -ld prod no-error.
   if not connected ("prod") then do:
     message " Could not connect to requested database.  Try again. ".
     undo, return.
   end.
 end.
 assign dest-dir = dump-ctrl.odest.
 run ctrl/purg-ctrl.p.

Figure 3: purg-ctrl.p Program Listing

 
 /* PROGRAM:  ctrl/purge-ctrl.p
     PURPOSE:    This program finds the first available load-qa record that represents
                 a file not having been dumped and executes the appropriate dump     program.
     AUTHOR:    J. D. Johnson                      
     DATE:  07/22/94
     MODIFIED:
 */
 {ctrl/dmp-var.i}
 def var prog-name as char format "x(20)"  
           no-undo.
 main-blk:
 repeat:
   do transaction:
     find next load-qa where link-ind = no exclusive no-wait
no-error.
     if not available load-qa then do:
       if locked load-qa then next main-blk.
       else quit.
     end.
     else do:
       if dump-ind then next main-blk.
       assign dmp-nme = load-qa.dump-name
              load-qa.dump-ind = yes
       lqa-recid = recid(load-qa)
       lqa-dprog = load-qa.dump-prog
       lqa-fname = load-qa.file-name.
       release load-qa.
     end.
   end.
   status default "Dumping " + lqa-fname.
   assign dmp-cnt = 0.
   if lqa-dprog = "" then
     run prog/gendump.p value(lqa-fname).
   else do:
     assign prog-name = "prog/" + lqa-dprog.
     run value(prog-name).
   end.
   do transaction:
     find load-qa where recid(load-qa) = lqa-recid exclusive.
     assign load-qa.out-cnt = dmp-cnt.
   end.
 end.

Figure 4: gendump.p Program Listing

 /*  PROGRAM:  prog/gendump.p
     PURPOSE:    This program takes runtime parameters from
               purge-ctrl.p and dumps the
               appropriate file.
     AUTHOR:     J. D. Johnson                          
     DATE:       07/22/94
     MODIFIED:
 */
 {ctrl/dmp-var.i}
 assign opath = dest-dir +  "/" + dmp-nme + ".d".
 output to value(opath).
 status default "Dumping " + dmp-nme.
 for each {1} no-lock:
   export {1}.
   assign dmp-cnt = dmp-cnt + 1.
   if dmp-cnt modulo 100 = 0 then do:
     pause 1 no-message.
     status default "Dumping " + dmp-nme + ":
" + string(dmp-cnt).
   end.
 end.
 output close.
 assign opath = "/usr1/purge/done/" + dmp-nme.
 output to value(opath).
 export dmp-cnt string(time,"HH:MM:SS") today.
 output close.

Figure 5: bulkload.sh Shellscript Listing

 #  PROGRAM:  bulkload.sh
 #  PURPOSE:  This script checks the /usr1/purge/done directory for the
 #            existence of files indicating that the dump of those database
 #            records has been completed.  If 1 or more files exist, it uses
 #            the name of the oldest file to feed the PROGRESS bulkload
 #            utility to load the appropriate data into the new database.
 #  AUTHOR:   J. D. Johnson               DATE:  07/22/94
 #  MODIFIED:
 #set -x
 while true
 do
 clear
   echo "\nInput the path to the data files\n\t\c"
   read DPATH
   echo "\nInput the name of the database being built:\c"
   read NDB
   echo "\nInput the path to the new database\n\t\c"
   read NPATH
   clear
   echo "You wish to load data from the  " $DPATH"  directory into\n\n\t\c"
   echo ${NPATH}/${NDB} "\n\n\t\t\t\tIs this correct?(Y/N) [.]\b\b\c"
   read ANS
   case $ANS in
   y|Y)
     break
     ;;
   *)
     continue
     ;;
   esac
 done
 echo "\nIs this the first execution of bulkload.sh?:
 [.]\b\b\c"
 read FE
 case $FE in
 y|Y)
   rm /usr1/purge/ctrl/bload.adv /usr1/purge/ctrl/loaded 1>/dev/null 2>/dev/null
   rm /usr1/purge/done/* 1>/dev/null 2>/dev/null
   rm /usr1/purge/ldone/* 1>/dev/null 2>/dev/null
   echo "\nDo you wish to create the 'Recipient' database?:
   [.]\b\b\c"
   read CDB
   case $CDB in
   y|Y)
     prodb ${NPATH}/${NDB} /usr2/ccprog/SCHEM/${NDB}
     ;;
   *)
     ;;
   esac
   ;;
 *)
   ;;
 esac
 clear
 echo "\n\n\n\n\n\n\n\n\n\n\t\t\tBulkload.sh in operation"
 PROPATH=/usr1/purge;export PROPATH
 ADV=/usr1/purge/ctrl/bload.adv
 echo "Bulkload.sh started at" `date` >> $ADV
 TOT=`cat /usr1/purge/ctrl/file.cnt`
 CNT=0
 cd /usr1/purge/done
 while [ $CNT -lt $TOT ]
 do
   DCNT=`ls | wc -l | awk '{print $1}'`
   case $DCNT in
   0)
    sleep 60
    ;;
   *)
    FN=`ls -t | tail -1`
    echo "Loading " $FN >> $ADV
    cd $DPATH
    /usr/dlc/proutil ${NPATH}/${NDB} -C bulkload /usr1/purge/fd/${FN}.fd | tee -a  $ADV
    sleep 5
    /usr/dlc/bpro ${NPATH}/${NDB} -p prog/bi-bld.p
    echo $FN >> /usr1/purge/ctrl/loaded
    cd /usr1/purge/done
    mv $FN ../ldone
    CNT=`wc -l /usr1/purge/ctrl/loaded | awk '{print $1}'`
    sleep 10
    ;;
   esac
 done
 echo "bulkload.sh completed at" `date` >>
$ADV

Figure 6: load-qa.p Program Listing

 /*  PROGRAM: load-qa.p
     PURPOSE: To capture statistics produced by the automated
              purge process and report any errors to the operator.
     AUTHOR:  J. D. Johnson                 DATE:  08/09/94
     MODIFIED:
     */
 def var in-fn as char format "x(20)" extent 5.
 def var in-ct as char format "x(20)" extent 5.
 def var idb-cmd as char format "x(80)".
 def var good-fn as log.
 /* Pull in actual load stats from bulkload.sh execution advertisement */
 input through /usr1/purge/ctrl/load-sum.sh no-echo.
 sum-blk:
 repeat:
   set in-fn[1] in-fn[2].
   if index("1234567890",substring(in-fn[1],1,1))= 0 then do:
     find load-qa where load-qa.file-name = in-fn[1] no-error.
     if not available load-qa then next sum-blk.
     assign good-fn = yes.
   end.
   else
     if good-fn then assign load-qa.in-cnt = dec(in-fn[2])
       good-fn        = no.
 end.
 /* Finally, check that all counts match and report those
that don't. */
 assign good-fn = yes.
 for each load-qa:
   if load-qa.in-cnt ne load-qa.out-cnt then do:
     display load-qa.file-name
      load-qa.dump-name
      load-qa.out-cnt
      load-qa.in-cnt
      with 4 col.
     assign good-fn = no.
   end.
 end.
 if good-fn then do:
   display skip(1)
    "  All Counts Match  "
    skip(1)
    with centered row 10.
   pause 10.
   find first dump-ctrl.
   output to /usr1/purge/ctrl/idx-bld.sh.
   put unformatted "/usr/dlc/proutil " + dump-ctrl.new-db
+ " -i -TB 31 -TM 32 -T " +  dump-ctrl.sdir + "
-C idxbuild all\n".
   output close.
   unix silent /usr1/purge/ctrl/idx-ctrl.sh&.
   quit.
 end.

/* PROGRAM:  dmp-var.i
   PURPOSE:  To define variables used in out-st.p, purg-ctrl.p,
           and gendump.p
   AUTHOR:   J. D. Johnson                         
   DATE:     07/22/94
   MODIFIED:
*/
def {1} shared var dest-dir as char format "x(60)".
def {1} shared var dmp-nme like load-qa.dump-name.
def {1} shared var lqa-recid as recid.
def {1} shared var lqa-dprog like load-qa.dump-prog.
def {1} shared var lqa-fname like load-qa.file-name.
def {1} shared var dmp-cnt like load-qa.out-cnt.
def            var opath as char format "x(50)".

---
John D. Johnson, President                Voice:  (214) 733-4958
J. D. Johnson & Assoc                     Email:  jdj@jdjohnson.com
5730 Buffridge Tr
Dallas, TX 75252