PARSE! PARSE! O.K.! (Apologies to Dr. Bronner)

7 January 2004

Mark Pilgrim had a sidebar link to an example of why you should use a real HTML parser, instead of a hodgepodge of regular expressions, to strip out unwanted and potentially dangerous tags/attributes from HTML that you receive. This is a topic near and dear to my heart, because:

When I set up Word Pirates, a site that allows anyone to complain about “pirated words” whose meanings are being distorted, I didn’t do anything to filter the HTML that was received from the outside. Days after the site went live, somebody posted a line of Javascript that redirected it to a porn site.
It reminds me of one of my pet peeves…

In my current job, my most recent project involved a conversion script. One of our products uses a GUI generator tied to a particular database system, using a proprietary language to describe how the input forms on the screen would look and what database fields each screen field was connected to. We want to use a new home-grown GUI system, which is (in theory) database-independent and uses an XML-based language (of course) to convey the same information. My predecessor had written a Perl script to convert from one language to to the other, but the script didn’t do the whole job; my mission was to improve the script where possible, run it over a hundred or so forms, and then manually clean up as many of them as my co-worker could throw at me.

The script that landed in my workstation was an inspiring piece of work. When it processed the forms it received as input, it used regular expressions to identify key words and syntactic constructions. Statements in this proprietary language usually ran across multiple lines, so the script resorted to this sort of technique:

 foreach ($line) {   # ...   if (/foobar (.*) begin/) {     $foobar_arg = $1;   }   # ...   if ($foobar_arg) {     # ...     $flag = 0 if /foobar end/;   }   # ... }

(Observe how, if someone had written a form in the original language that said foobar(baz) { instead of foobar (baz) {, the script would become quietly confused and output gibberish.)

Meanwhile, on the output end, I had a two-hundred-and-thirty-line for-loop, containing a two-hundred-line for-loop, which added little bits of text to three different strings as it navigated through a two-dimensional array of the names of keys of multilevel hashes, and then concatenated these strings into an XML file at the very end. Unless, of course, the input had triggered some obscure bug and the script generated invalid XML. And I was proud of myself for refactoring the script until it only had twenty-six global variables. Etc., etc., etc.

This project bore a strange resemblance to one of my first projects at my previous job, where I had to extend a Perl script that took an XML file as input and generated a PostScript page with the same information … using regular expressions to take apart the XML, instead of using the perfectly adequate XML parser that comes free with Perl, so that an extra space or carriage return would throw it into complete confusion.

You see, about thirty years ago, a bunch of smart people realized that they had better things to do than construct and debug such monstrosities every time they had a new language to interpret, so they invented lexers and parser generators. Every widely-used language has at least a few of these things, free for the taking, that you can use to describe what a language looks like and where to find information in its statements, instead of taking it apart line by line and using regular expressions to perform abominations in the eyes of God and Larry Wall. But hordes of coders out there seem committed to reinventing the flint and steel all by themselves in their own caves, instead of staggering over to the cave that already has a fire going and grunting “Ogg please borrow Zippo lighter.” If I had faith in the efficiency of the free market, I would be comforted by the knowledge that as long as so many others were creating these messes, I could make money cleaning up after them, but … well.

Of course, as XML hype continues to overtake the world, such convoluted and ignorant use of regular expressions will become rarer and rarer. Instead, we can look forward to convolted and ignorant use of DOM and SAX. Rapture!

technology