About DTD Grammar

We were busy working out DTDs in CIS 205, and when I pulled up the link to validate a DTD grammar, not only did validome.org not come up, but the website I found to find out if it was up or down gave me a drive-by hack. Deep freeze to the rescue!

But that didn't solve the larger issue, and 4 days later, validome.org is still down. Php to the rescue! I've built this little parser to validate a DTD grammar. It's incomplete, focussing on the items we need for class.

  1. Things it does not do and may never do:
    • Have conditional sections. I didn't even know these were a thing. They are really cool, but beyond the scope of our class.
    • Parse processing instructions intelligently; it allows them anywhere, in any flavor and contents (all it validates is the at the end). There are rules, and I am not applying them.
    • Allow UTF-8 characters in identifiers and nmtokens. Just ASCII alphabets in identifiers and nmtokens, thank you.
    • References external entities (i.e., pull them in and apply them). This means some parameterized entity references remain in place and may cause syntax errors (if used where more than one token is expected, for example).
    • Detect circular references for parameter entity references. Php will max out its memory and an exception will be triggered, instead.
    • Detect proper use of entity references, character entities, and named entities in string literals.
    • Parse PUBLIC and SYSTEM entity rules. Once it hits the keyword, it fast forwards to the end of the rule.
    • Parse NOTATION rules. Onces it hits the identifier, it fast forwards to the end of the rule.
    • Parse NOTATION attribute types. Onces it hits the identifier, it fast forwards to the end of the rule.
  2. Things it does do:
    • ENTITY rules: checks for legal syntax throughout
    • ATTLIST rules: checks for legal syntax throughout
    • Internal Parameterized Entity Reference substitutions in place; it detects these types of ENTITY rules and maintains a table to do substitutions. They must be defined before their first use.
    • Comments are restricted to no -- inside the comment. Now I understand more fully why that's a rule!
    • An attempt at error recovery: if there's an error in a rule, it fast-forwards to the end of the rule and starts again at the next rule.
  3. Things I may fix:
    • Whitespace rules are a little more generous than the standard -- there are rules where the standard only allows \x20 and I am allowing all 4 whitespace characters.
    • I am allowing newlines in more places than I should.
    • Because external entities are not pulled in, there is no error when an entity that was not defined is used. This could be addressed with a lookup-without-substitution.
    • I like the idea of a DTD linter -- for example, using keywords like CDATA places other than where they belong; this is detectable, but the standard says when CDATA is used other places, it's just an identifier and is allowed. Not friendly to learners.

If you should run into any issues with this grammar checker, I'd love to hear about it; I've run about 4 sessions of CIS 205 DTDs through it, and several from the web, and used them to fine-tune it to some degree. Contact me with your feedback at agarripoli-theusualsymbol-olympic-theothersymbol-edu or file an issue in its github project DTDgrammar. Thank you.

Built with the help of google, w3, php.net, stackoverflow, skeleton (HTML shell with normalize), xampp, and brackets. Maintained on github. Released under an MIT License.

This website does not save any input you provide to it in the form. Once it has parsed your data, it is not kept on the server. The web server it runs within may track traffic and monitor use of the server, please request its status if that is a concern.