Thursday March 7, 2002

So far on my programming project, (I'm calling it 'posterchild') I spend most of my time worrying about the codepages and escaping of specialized markup or punctuation characters. Separating content from design - or designing the separation of the separation. This is disappointing. I wish I was designing new ways to structure my content and narratives. To provide interesting context for the reader. To make recycling the garbage I've already written easier for anyone. As I tried to mention last week, this is the fault of the current crop of X related technologies. XML etc force me to process textual information with tools the equivalent of assembly language programming.

Unicode is a pretty dry subject and it's something that software engineers in charge of localization have only been forced to deal with. But I, as a not-so-humble writer, if I want to do something as simple as have curly quotes in my writing, or make sure a tool I was using to compose a piece wasn't inserting characters in my writing that would not display on someone else's computer, have to wallow in the details of Unicode or XSLT syntax.

Do you think this is a common thing for writers to deal with? Hell no. Most of them don't even have the most recent copy of Word. In the old days you had grunts to rework and misinterpret your type and/or handwriting before publishing. Think about what happened to James Joyce: fuckups who can't grok some cutting-edge shit creating entire industries of critical analysis bent on figuring out the real intentions of the writer.

I don't care about internationalization of my writing. Maybe that will matter some day but not now. I'm just doing it for the punctuation and IN-LINE MARKUP. Oh, the place that "standardization" has got us to in regard to in-line markup is one of the best yet! Most XML proponents claim that I, as a writer, (because of some 9 year-old girl that's reading my shit in Braille by the fire in a cabin in the woods), don't really want *bold* text, I want <strong> text. They say that bold is just plain wrong. It has no semantics. What bullshit. They say, "It is the job of the author to indicate the meaning and let the rendering device determine the best way to communicate the meaning of the words." That makes sense on some levels but I just want my texts to be text. If you know how to use it, text is rich enough. I'm a professional after all and I put a lot of time into rearranging words.

They (and I mean the SGML and W3C nerds that spec'd this shit out) build the tools assuming I'm some stupid "knowledge worker" writing a software manual who tries to use Word like it's a typewriter. These standards are reactionary. Because, I admit it, there are a lot of dumb people out there, bastardizing poor little web standards. But have you looked at their web pages? They suck. I can immediately spot your basic weak-ass CSS-driven box-model crap. This is the hand-me-down toolset that the poor artist gets from the business world. Some of it is genuinely soul-sucking and evil. Well, I won't give up just because it's hard. Now that I know how much a computer mangles type and that text files have a codec the same way computer video and audio have codecs, I can't ignore it. Encoding schemes are central tools to a digital artist.

I thought XML was the magic high level text encoding scheme I was waiting for. It does allow a writer to boost the meaning in their words, layering meta information over a narrative, placing hooks into the stream of words that can used later for undiscovered purposes. Its descriptive power is one level higher than HTML, and I don't have to litter my writing with programming calls to a web browser. (What a revolution!!) Everybody seems to be considering it the preferred archival format because it is text-based and human readable. But XML was the asshole that drug Unicode, and XSLT (the most fucked-up code I've seen since my attempts at 16-bit Windows GUI programming in C) in front of my face. I know of no way to work with Unicode except through programming languages. I suppose you could script a text editor to open all your files and then convert the codepages and re-save them as UTF-whatever the hell. But I doubt it would work.

XML has its own encoding problems that are !< Unicode, but the Unicode thing was something that I'd never anticipated. Now I have to escape Unicode strings into browser understandable general entities and I have to escape HTML markup inside of XML markup and I have to escape XML markup inside of XSTL markup and I have to escape XSLT markup inside of Python code. That's all so I can print the apostrophe in the first word of this sentence. There is no intelligence built into the layers. (I should mention that in the present entry, this is not exactly the case: I've decided to do this entry in UTF-8 but only utilizing "lower 128 ASCII" punctuation marks so one level of those escape sequences is reduced. Whew!) In networking these things take care of themselves: Ethernet and IP and Winsock and HTML use trUE layer abstraction. The hard stuff is done by driver programmers who manage the transition between layers. But now I'm that driver programmer.

>>> print 'I’ve had it'.encode("html-utf-8")

I&#226;&#128;&#153;ve had it

In text processing, the distinctions I make go from the categorical difference between this being a rant and a review, the semantic difference of a summary paragraph and a body paragraph, down to the difference between how many bytes of computer memory are allocated to represent the RIGHT SINGLE QUOTATION MARK. That html-utf-8 codec listed above is obviously doing some bad byte allocation. It should have been I&#8217;ve had it. But what are you going to do? I downloaded it off some ftp server at Xerox PARC and there was NO documentation and this is the only method I've found for escaping Unicode characters into HTML entities that are above ASCII 127.

I guarantee you that nobody is going to do this for me. I don't want to count how many times I've had to remove hard line-breaks from a piece of text because it went through email or some text editor. I don't think most people know that you can do a search and replace on a piece of formatting like an end-of-line in Word but you can. Not to mention the difference between Unix and Windows line breaks.

On a previous attempt at redesigning my website I ran into Cascading Style Sheets and Netscape 4.x and it sent me reeling. I vowed to never go near HTML again and never to trust "rendering devices". Maybe this is why there are no abstraction layers between my rants and memory registers. If I trusted the next level down or up it would be fine but when you've got shit like HTML and Netscape to work with....well, fuck that. I trust Ethernet. The guys who designed that (my man Richard Johnson!) made it so what you put in one end of the pipe came out the other. Wow, how novel.