Sustaining Validation
June 19, 2004 |
13 Comments
I don’t want to open a can of worms here, but I thought it might be helpful to share my experience over the last week trying to put in place a better system to keep Asterisk valid. Don’t worry I wont bore you with all the details.
As you may know, my main problem with validation on this site has been the comments.
As some of you may also know, after a heated (and ultimately rather silly) discussion over at Mezzoblue about Web standards and validation, two sides (Jacques and myself) decided to try and do something productive about it.
As Dave relates in a post that pretty much sums it up, Jacques has been helping me get my comments under control.
Now, I don’t want to even go there with the whole “is validation important” issue. It’s been done, and I’m not touching that. However, as I’m sure Jacques will attest to, this can be far from “easy” (quickly becoming a bad word for me) for many people. For a variety of reasons.
We tried several different solutions to varying degrees of success. There are so many factors that can come into play with this stuff. Your CMS, your server configuration, your templates — anyone one of these things can muck up an automated solution to fix invalid markup in comments.
In the end, about a week and a half later, I’ve used a combination of the Amputator plugin for ampersands, Jacques’ MTStripControlChars, a forced comment preview and some good old explanation text. My hope is that this should get most of the bad code coming in.
It’s not bulletproof, and we tried, and failed with, several more sophisticated solutions. I ran into immediate problems with Jacques’ preferred solution, MTvalidate. This is a pretty complicated solution to begin with and I wouldn’t recommend it for everyone. If you can get it working, I’m sure it’d be great. For my own situation my host can’t/won’t supply the proper perl modules to get it up.
I’m not going to go to the trouble of switching hosts just to get something like this up.
I also tried a few other plug-in combinations that killed my cgi scripts. There was quite a bit of trial and error on my part, but I eventually got something that hopefully do the trick. Please let me know if you have any problems with it.
I want to thank Jacques for his help and good humor through out. It was a bit of a learning experience if nothing else and even though we couldn’t get it all the way there, it’s an improvement.
The bottom line here: validation can be hard to sustain. I’ve pointed this out before and it bears repeating. It doesn’t matter which side you’re on, in the real world, to expect folks who offer things like comments on their site to be 100% valid is unrealistic at this point in time.
With that, I’m off to more important things. In this case, a round of Golf with some brand new clubs, some summer sunshine, a few beers, good friends and a wild night out on the town with the boys.
See you Monday.
Filed under: Web Development
Comments
1. Colly said:
It’s pleasing to see developers coming together to find solutions to such issues, but this almost always involves Movable Type solutions. The possible over-reliance on MT was apparent through the recent community outrage at their revised pricing structure.
What about those that use alternative CMS, such as TextPattern, EE/pMachine, or our own applications? Perhaps we need to explore more generic, cross-CMS solutions.
As you suggest, it is unrealistic to expect comments to validate always. But then, who would be pedantic enough to criticise a site where only the comments flaw validation. I know, people do…
Posted on June 19, 2004 08:04 AM | #
2. Philippe said:
Have you tried Textile? As implemented in Textpattern, it does a pretty good job at keeping comments clean. If I remember correctly, there is a plugin for MT (it has been ages since I usud MT).
Of course, there is no way to know what kind of madness a visitor will try to input, esp. if the audience is not aware of html (not the case here I guess).
Posted on June 19, 2004 08:32 AM | #
3. Jacques Distler said:
“As you suggest, it is unrealistic to expect comments to validate always.”
My personal vision of these things is that every web content-creation tool should have validation hooked into the “Preview” function. So, whether you’re previewing a comment on an MT blog, or a page you’re creating in DreamWeaver, it should be checked *automatically* for validity, as well as for whether it “looks” right.
MT is the CMS that I use. And I’m fairly handy in Perl, the language that both MT and the W3C Validator are written in. So that’s where I’ve concentrated my efforts.
A PHP-based solution (for instance) might catch on more widely, but I’m not the one to do it. The main issue, in my eyes, is ensuring that the error-reporting is relatively user-friendly. Even the “new” W3C Validator, which is probably the most user-friendly of the validators out there, can be pretty obscure, at times, in its error reporting.
Having struggled a bit with Keith’s situation (an uncooperative webhost, no shell access, …), it’s clear that we’re pretty far from the point where installing MTValidate is a plug ‘n play operation.
But I’m glad Keith has a more robust (if not exactly “bulletproof”) system going now.
Enjoy the weekend.
Posted on June 19, 2004 08:36 AM | #
4. DarkBlue said:
Looking through the comments that are posted on various pages here, I don’t actually see that many posts that are not anything other than plain text (YMMV).
In which case, the very easiest solution would be to not allow any kind of markup at all. This is very easy to implement with a little Perl:
# Standardize line endings:
$_[0] =~ s{\r\n}{\n}g; # DOS to Unix
$_[0] =~ s{\r}{\n}g; # Mac to Unix
# Strip the HTML:
eval {
$_[0] =~ s{
<! (.*?) ( – .*? – \s* )+ (.*?) >
}{
if ($1 || $3) { “<!$1 $3>”; }
}gesx;
$_[0] =~ s{
< (?: [^>’”] * | “.*?” | ‘.*?’ ) + >
}{}gsx;
};
# Convert newlines to XHTML:
$_[0] =~ s{\n}{
}g;
# Encapsulate the comment:
$_[0] = ‘<p>’ . $_[0] . ‘</p>’;
Note: $_[0] contains the comment text in the above example.
This would guarantee that pages of comments would be valid, whilst reducing functionality only marginally.
Note that the semantics are not perfect with this approach since you would end up with multiple paragraphs within a single “<p>…</p>” pair. Of course, it’s perfectly possible to apply to correct semantics, but the code would be (just a little) more complicated.
Posted on June 19, 2004 09:34 AM | #
5. DarkBlue said:
EDIT:
# Convert newlines to XHTML:
$_[0] =~ s{\n}{}g;
Should read:
# Convert newlines to XHTML:
$_[0] =~ s{\n}{<br>}g;
Posted on June 19, 2004 09:38 AM | #
6. Matt said:
It is admirable that you’re putting the effort in. Congrats and good luck.
Posted on June 19, 2004 09:49 AM | #
7. Richard@Home said:
Ahhh, comments, comment, comments…
I’ve had a real uphill battle trying to ensure the comments on Richard@Home didn’t break my XHTML Strict doctype. My browser of choice is Mozilla, so malformed comments totally break the page. I wish Mozilla handled it like Opera.
I blogged a one line, PHP version of DarkBlues code a while ago: Simple Method For Ensuring User Comments Don’t Break Your XHTML Web Site with PHP (I was going for the longest blog article title record that day)
If you allow your users to enter any kind of markup, things get complicated fast.
My current system (which has been running since February with a few minor modifications) runs the comment through PHP’s strip_tags and Luis Argerich’s excellent Class XML Check to validate it. Once PHP5 is out of beta, I’m planning on using PHP5’s inbuilt XML Modules and HTML-Tidy.
(As a minor aside, your preview seems to be inserting an extra blank space at the start of the post each time you preview it)
Posted on June 19, 2004 11:27 PM | #
8. Richard@Home said:
Also worth a look: FCKeditor - havent tested it yet, but looks great (and produces XHTML).
Posted on June 19, 2004 11:38 PM | #
9. Anne said:
Keith, keep up the good work! The only thing you might want to consider is to switch to utf-8, which could take some effor, but will make you completely forward compatible.
Richard, that is a very evil method for doing it in PHP. First, you should never use ‘htmlentities’, since it can’t handle multiple byte characters (it actually messes them up). Try to use ‘htmlspecialchars’ at all cost.
Second, you ruin the semantics by removing all the paragraphs and other semantic elements without putting the back. And last, you use BR now where you should have used P instead. You might want to look at http://photomatt.net/scripts/autop instead for doing the conversion.
On the bright side, you *are* using utf-8 (and ‘application/xhtml+xml’ (!)), which all browsers understand (they will not enter windows characters at all) and is forward compatible.
Jacques, validating everything might be nice, but shouldn’t the software just make sure it *is* valid (maybe well-formed is better), without noticing the user (unless the user is entering HTML perhaps)?
Posted on June 20, 2004 12:53 AM | #
10. Jacques Distler said:
User-friendliness demands that you fix as many of the commonest errors as you can automatically.
But…
Some things are always going to be ambiguous and — particularly if you allow markup — not easily fixable in an automated way. After all, if fixing invalid (or ill-formed) input were as easy as applying a few REGEXPs, we wouldn’t care about well-formedness in the first place. We’d just fix everything on-the-fly.
So I say: start with validation as a baseline and catch-all. Then start adding features to catch and correct the common errors before they get fed to the Validator.
You won’t correct all errors that way. But hopefully, you’ll reduce the number that the user has to deal with to a minimum.
Posted on June 20, 2004 01:41 AM | #
11. Richard@Home said:
In my defence Anne, I posted it as a ‘simple’ method.
The method I use in my blog (as detailed in my first comment) DOES preserve the semantics. They only semantics I strip out are [script]’s and the like.
Posted on June 20, 2004 02:34 AM | #
12. Michael Watts said:
DarkBlue: you’re almost exadgerating with “just a little” aren’t you. It will essentally only take a nested if statement to fix the semantics won’t it?
Just what to strip from comments and how to deal with things is quite a complicated issue, one I have been considering quite alot for the scripting for my site which I’m working on at the moment. I think the system you’re using now deals with things quite well without over complicating things for the user…
Posted on June 20, 2004 06:43 PM | #
13. Daryl said:
Regarding the installation of perl modules, you should be able to dump them into a local directory and just push them onto the @INC array at the beginning of your script. Sounds like you’ve already more or less gotten around your host’s limitations by finding other methods, but for future reference…
Posted on June 21, 2004 05:14 AM | #
Comments are now closed