Back to News

You can't parse XML with regex. Let's do it anyways.

dmi
October 5, 2025 at 03:58 AM
Anger (30%)
neutral
You can't parse XML with regex. Let's do it anyways.

Key Takeaways

  • Parsing XML or HTML with regular expressions is an infamous and widely discouraged programming pitfall.
  • The corporate use of the term "content" is criticized as devaluing specific artworks and writings.
  • XML is significantly more complex than modern data interchange formats like JSON or TOML, leading to security liabilities.
  • The difficulty in fully understanding XML's extensive specification leads inexperienced developers to use inappropriate tools like regex.
  • The author contrasts regex parsing with the proper method, which involves using a stack-based parser to navigate the document tree.

The article opens with a critique of corporate jargon, specifically the use of the word "content" to describe artworks, suggesting it devalues creative output. This sets the stage for a technical discussion centered on the infamous anti-pattern of parsing structured data like HTML or XML using regular expressions. The author acknowledges the widely accepted wisdom against this practice, referencing popular Stack Overflow answers that explain why it's the wrong tool for the job. The text then pivots to an exploration of XML itself, noting its complexity compared to formats like JSON or TOML, which are simpler for developers to grasp fully. This complexity, encompassing a 59-page specification, creates a security liability and is the root cause of why inexperienced developers attempt regex parsing—a classic "you don't know what you don't know" scenario. Finally, the article begins to illustrate how a proper, stack-based parser navigates an XML tree structure.

Related Articles