I have 300+ XML files that look like this:
Mathematics
Geometry
Coordinate Geometry
Plotting Ordered Pairs
// Lots of content
(eof)
It SHOULD read like so:
Mathematics
Geometry
Coordinate Geometry
Plotting Ordered Pairs
// Lots of content
Does there exist there a batch solution to correct this?
Solved
The Saxon XSLT processor allows you to specify a directory containing source documents on the input line:
java net.sf.saxon.Transform -t -s:inputdir -o:outputdir -xsl:theAbove.xslt
which has the advantage that all the initialisation cost (like compiling the stylesheet) is only incurred once.
The easiest way is to use an XSLT processor (like xsltproc for Linux). You have to surround your input XML with a root element for the XML to be valid.
Then use this XSLT-1.0 file (theAbove.xslt) for transforming your XML.
Output:
Mathematics
Geometry
Coordinate Geometry
Plotting Ordered Pairs
Call it with
xsltproc theAbove.xslt yourSource.xml
It is possible to use two Perl regular expression replaces, second one with marking groups and back-references, to reformat the header area of your XML files from
Mathematics
Geometry
Coordinate Geometry
Plotting Ordered Pairs
// Lots of content
to
Mathematics
Geometry
Coordinate Geometry
Plotting Ordered Pairs
// Lots of content
So the block with lots of content is not unindented, but the other lines are as wanted on running this batch file from within the directory containing the *.xml files. Windows command interpreter cmd.exe does not support Perl regular expression replaces in text files. For that reason is needed additionally JREPL.BAT written by Dave Benham which is a batch file / JScript hybrid for reformatting the lines in the XML files using regular expression replaces. JREPL.BAT must be in same directory as this batch file.
@echo off
if not exist *.xml goto :EOF
if not exist "%~dp0jrepl.bat" goto :EOF
for /F "delims=" %%I in ('dir *.xml /A-D /B') do (
call "%~dp0jrepl.bat" "^[\t ]*(?:AREA|SECTION|SUBJECT|TOPIC)>[\t ]*\r?\n" "" /M /F "%%I" /O -
call "%~dp0jrepl.bat" "^[\t ]*<(AREA|SECTION|SUBJECT|TOPIC)(.*>)[\t ]*(\r?\n)[\t ]*(.* )[\t ]*$" "<$1$2$3 $4$3$1>" /M /F "%%I" /O -
)
For understanding the used commands and how they work, open a command prompt window, execute there the following commands, and read entirely all help pages displayed for each command very carefully.
call /?dir /?echo /?for /?goto /?if /?jrepl.bat /?
No comments:
Post a Comment