Parsing XML with Regular Expressions

2020-04-23 |

#regex

DISCLAIMER: parsing XML with regexes is an extremely bad idea, and you shouldn't do it. This article is half-joke and half about advanced regular expressions techniques. Obligatory link to detailed explanation on StackOverflow.

Someone told me that I can't parse XML with regexes, and I just had to prove them wrong. (Actually it was a couple of years ago and since then a draft of this article was sitting in a dusty folder on my hard drive.)

Without any further introduction, here's the ~~beast~~ nice one-liner consisting of 102 characters:

\s*(?(?=<)<\s*(\w+)(?:\s+[^\s>]+=("|'|)[^\s"'>]+\2)*\s*(\/\s*)?>(?(3)|(?R)<\s*\/\s*\1\s*>)|[^<]*)*\s*

https://regex101.com/r/oPLhLO/2

It can parse usual tags without namespaces, attributes and self-closing tags. You can even feed it books.xml from MSXML SDK examples. (But without declaration-line at the top, XML declaration is one of many things this regex cannot parse.)

Untangling the mess

There're many regular expression engines, but regex above is written for PCRE - Perl Compatible Regular Expressions. And once you hear "Perl" in the name, you probably already know it can do pretty much anything.

To understand how this particular regex works it should be formatted, like code in any normal programming language. PCRE and most other engines have a special flag x to ignore whitespace characters. This gives us an ability to indent and visually group parts of a regex.

\s*
(?(?=<)
  < \s* (\w+)
  (?: \s+ [^\s>]+ = ("|'|) [^\s"'>]+ \2 )*
  \s* (\/\s*)? >
  (?(3)| (?R) <\s* \/ \s* \1 \s*> )
|
  [^<]*
)*
\s*

https://regex101.com/r/oPLhLO/3

This is a little better, but for me it's still hard to keep track of capturing groups. Another feature of most regex engines are named groups. They allow you to give an actual readable name to a group instead of counting opening parenthesis.

Let me tell you a little secret. This is the version that I originally wrote and then reduced to that one-liner. Why? The original looks much more terrifying and hacky, doesn't it?

\s*
(?(?=<)
  (?<opentag>
    < \s*
    (?<tagname>\w+)
    (?<attibute>
      \s+
      (?<attrname>[^\s>]+)
      =
      (?<attrquote>"|'|)
      (?<attrvalue>[^\s"'>]+)
      (\k{attrquote})
    )*
    \s*
    (?<selfclosing>\/\s*)?
    >
  )
  (?(<selfclosing>)|
    (?<children>(?R))
    (?<closetag><\s* \/ \s* \k{tagname} \s*>)
  )
|
  (?<text>[^<]*)
)*
\s*

https://regex101.com/r/oPLhLO/5

Even now when it's more or less readable not everything is clear. But before I can explain what this regex does I'd like to remind you about some features used here.

Named groups

(?<name>...) or (?'name'...) in most engines. Gives you an ability to name your groups with readable names.

Non-capturing groups

(?:...). These groups are not captured. Useful when you need to group something only to apply quantifier to it. And for general readability. Also, because their insides are not captured, regex engines should treat them in a more performant way than capturing ones.

Backreferences

\k{name} will match exactly what named group name have matched. Classic syntax is \1, but I think it's ugly and I wouldn't recommend using it if you have named groups.

For example, <(?<tag>\w+)>[^<]*<\/\k{tag}> will match any xml-like tag with a text inside and a correct corresponding closing tag. https://regex101.com/r/KW0OWh/1

Lookahead and lookbehind

(?=PATTERN) will match zero-width PATTERN. I.e. it will not consume anything, just check if PATTERN is ahead.

For example, (?=.*\d.*\d)(?=.*[A-Z].*[A-Z]).{8,} will match any password which is at least 8 symbols long and has at least 2 digits and 2 capital letters. https://regex101.com/r/zIIn76/1

(?!PATTERN) is called negative lookahead and will match zero-width if PATTERN is not ahead.

There's also lookbehind: (?<=PATTERN) and (?<!PATTERN). It does the same thing but, well, looks behind and not ahead.

Conditionals

(?(IF)THEN|ELSE) will match zero-width IF (again, it will not consume anything) and if it succeeded match THEN and ELSE otherwise. For IF you can use lookahead and lookbehind constructs and groups names to check if they've matched before.

Recursion

(?R) in PCRE will match same regex from current position and then return back. So, yeah, recursion in regular expressions.

Logic behind the braces

Our xml is defined as some tag or some text surrounded with optional whitespace.

\s*
(?(?=<)
  (?<opentag>...)
  ...
|
  (?<text>[^<]*)
)*
\s*

Tag itself can be self-closing or with some children. If it's self-closing then we're done: note there's no THEN branch in (?(<selfclosing>)|.... Otherwise we need to recursively parse children and then a closing tag.

(?<opentag>
  ...
  (?<selfclosing>\/\s*)?
  ...
)
(?(<selfclosing>)|
  (?<children>(?R))
  (?<closetag><\s* \/ \s* \k{tagname} \s*>)
)

All the other stuff in (?<opentag>...) I find self-explanatory.

That's it.

Final thoughts

If it's your first time seeing this stuff, I'm sure you'll have some questions.

Are regular expressions in programming languages (and especially PCRE) really "regular" in terms of Chomksy hierarchy? No, they are not.

What type of grammar they can parse then? It's not clear. They're not Turing complete but they can match some context-free grammars. If you substitute text in cycle (this becomes an unfair game, to be honest) then you can even write a BrainFuck interpreter, which is Turing complete.

Does your regex engine support all these crazy features? Check docs for your engine and Regular-Expressions.info Reference.

Should you parse XML with regexes in production or, really, anywhere? Most definitely not.