Sunday, August 19, 2018

Batch-correct incorrectly formatted XML files

I have 300+ XML files that look like this:


  Mathematics 
  
    Geometry 
    
Coordinate Geometry Plotting Ordered Pairs // Lots of content
(eof)

It SHOULD read like so:


  Mathematics


  Geometry

Coordinate Geometry
Plotting Ordered Pairs // Lots of content

Does there exist there a batch solution to correct this?

Solved

The Saxon XSLT processor allows you to specify a directory containing source documents on the input line:

java net.sf.saxon.Transform -t -s:inputdir -o:outputdir -xsl:theAbove.xslt

which has the advantage that all the initialisation cost (like compiling the stylesheet) is only incurred once.


The easiest way is to use an XSLT processor (like xsltproc for Linux). You have to surround your input XML with a root element for the XML to be valid.

Then use this XSLT-1.0 file (theAbove.xslt) for transforming your XML.





  

  
    
  

  
    
      
      
    
    
  


Output:


    Mathematics


    Geometry

Coordinate Geometry
Plotting Ordered Pairs

Call it with

xsltproc theAbove.xslt yourSource.xml

It is possible to use two Perl regular expression replaces, second one with marking groups and back-references, to reformat the header area of your XML files from


  Mathematics 
  
    Geometry 
    
Coordinate Geometry Plotting Ordered Pairs // Lots of content

to


  Mathematics


  Geometry

Coordinate Geometry
Plotting Ordered Pairs // Lots of content

So the block with lots of content is not unindented, but the other lines are as wanted on running this batch file from within the directory containing the *.xml files. Windows command interpreter cmd.exe does not support Perl regular expression replaces in text files. For that reason is needed additionally JREPL.BAT written by Dave Benham which is a batch file / JScript hybrid for reformatting the lines in the XML files using regular expression replaces. JREPL.BAT must be in same directory as this batch file.

@echo off
if not exist *.xml goto :EOF
if not exist "%~dp0jrepl.bat" goto :EOF

for /F "delims=" %%I in ('dir *.xml /A-D /B') do (
    call "%~dp0jrepl.bat" "^[\t ]*[\t ]*\r?\n" "" /M /F "%%I" /O -
    call "%~dp0jrepl.bat" "^[\t ]*<(AREA|SECTION|SUBJECT|TOPIC)(.*>)[\t ]*(\r?\n)[\t ]*(.*)[\t ]*$" "<$1$2$3  $4$3" /M /F "%%I" /O -
)

For understanding the used commands and how they work, open a command prompt window, execute there the following commands, and read entirely all help pages displayed for each command very carefully.

  • call /?
  • dir /?
  • echo /?
  • for /?
  • goto /?
  • if /?
  • jrepl.bat /?

No comments:

Post a Comment