Thraxil.org

// thraxil.org

New Year

By jere 05 Dec 2005

Is coming up fast. The annual shebang is still on and all are invited. Let’s see if I can make the links to <a href = “http://thraxil.org/users/jere/posts/2001/12/27/directions-to-jere-s/“>directions for getting here</a>. And, heather says it’s mandatory. I say please.

what is a photo?

By anders pearson 27 Nov 2005

there’s been some controversy lately over at flickr about their policy of not allowing drawings, paintings, illustrations, etc. on the site. well, “not allowing” isn’t quite accurate: they don’t delete the images, they just flag a user’s account as “Not Public Site Areas“ (aka ‘NIPSA’) which means that none of a flagged user’s images will show up in global tag pages or group pools (except to logged in flickr users who are actually signed up to the group). but since flickr’s real draw is the community aspect, being NIPSA’d effectively cuts a user out of the community; they might as well just have their account shut off.

ignoring the issues of whether drawings, paintings, illustrations etc. have any overall negative effect on flickr, whether this is a very “web 2.0“ policy for a service often touted as one of the leaders of the whole “web 2.0” scene, whether it’s good business practice to purposefully alienate and frustrate a thriving and enthusiastic sector of their customer-base, and even ignoring the fact that they flag accounts in this way without in any way notifying the users that they’ve been flagged, i think this raises deeper philosophical questions.

reading through the threads in flickr’s forum, the flickr admins seem genuinely astounded that anyone could have an issue with this policy. the common refrain repeated over and over again is that “flickr is a photosharing site. it’s for sharing photos only” so why is anyone surprised?

my natural response is to ask, “well, what is a ‘photo’, anyway?”

what exactly is the magical defining quality that makes one image a ‘photo’ and another a ‘non-photo’ and thus not suitable to be posted on flickr? where is the line?

let’s start with something we can probably all agree on. here’s a photo from my flickr stream. it was taken with my digital camera and uploaded with no processing (aside from flickr resizing it). it might not be a great photo, but it’s pretty typical of what’s on flickr.

now, right off, we run into the issue that this is a digital photo. accepting a digital photo as a photo is a relatively new phenomenon in the photographic community. i don’t think you’d have to look very hard to find some stodgy old gray-beard photographers who would still insist that if it doesn’t involve film and chemicals and a dark room, it’s not “real” photography. but most of us aren’t that snobby so we’ll agree that a digital photo taken with a digital camera is still a “photo”.

next is the issue of PhotoShop. this is where things get murky real fast. most professional photographers use photoshop or some other image editing software to post-process their photos. cropping them, adjusting the contrast, removing red-eye. these are all common operations and most people wouldn’t revoke an image’s status as a “photo” because of them. of course, once again, you can also find communities of digital photographers who shun the use of photoshop and insist that it only counts if the image is left exactly as the camera recorded it. every profession or hobby has its share of cranks.

but how much can you really get away with? can i crop a photo down to one pixel and still have it be a ‘photo’? if the answer is that yes, it’s still a ‘photo’, then what about another image which consists of a single pixel of the same color except that it was created entirely digitally without light ever being reflected off an object, passing through a lens and onto a photo-sensitive surface? the resulting image files will be exactly identical so how could one justify a difference?

here are two images, one the result of photoshopping a digital camera photo (actually using the Gimp, not PhotoShop, but same diff) and the other an immaculate digital creation. can you tell which is which?

if there is some point at which manipulating a photo makes it no longer a photo, where exactly is that point? does cropping a photo down to less than 11% of its original size change it’s nature while 12% percent is ok? does it depend on what the subject matter of the photo is?

if i take this photo of a painting:

and crop it down and clean it up into this:

does that make it no longer a photo? or is any photo with a painting in it at all, not a “photo”? even if it’s not the focus of the picture?

what percentage of the image is the painting allowed to take up and still be considered a “photo”? what if someone uses photoshop to composite several images together like this?

is that still a photo? is it only because all of the sub-images pass as “photos”? what if one out of the four were a “non-photo”? two out of four? where’s the cutoff?

personally, i would call it a “photo of a painting”. actually, i would probably just call everything an “image” and leave it at that. but these are the kinds of photos that flickr has decided are not photos.

anyone’s who’s seriously tried to take decent photographs of paintings or drawings also knows that it’s not a trivial task. actually, it’s a royal pain in the ass to get it to come out right and requires some real photographic skills like an understanding of lighting and focus and depth-of-field issues. that’s why my photos of paintings suck; i’m not a very good photographer.

ok, i don’t think it’s too much of a stretch to argue that a photo of a painting or drawing is still a ‘photo’. what about scanned drawings and illustrations? does ‘photo’ mean that at some point in its life, light must have passed into some device that we label a “camera”? a scanner functions very similarly to a camera, but i guess you could argue that it’s different enough that it doesn’t really count as a “camera” and thus images that it creates aren’t “photos”.

so, how about this “photo” taken with a 35mm film camera and scanned in?

does passing through a scanner strip it of its “photo” nature? how about this photo of a painting taken with a 35mm film camera and then scanned in:

if a “photo” can only come from a camera, what about photograms, the staple of introductory photography classes? are they “photos”?

what about scanner photography? are scans of flowers ok but scans of pieces of paper with ink on them not?

clearly, i think this whole business is absurd, arbitrary, and petty. i think flickr should lighten up, remove the non-photo related NIPSA flags from accounts and promise never to do it again. flickr happens to be a great tool for sharing drawings, paintings, and illustrations whether they’re “photos” or not and i think they would do well to embrace that rather than start punishing their customers for using the service in a way that they hadn’t thought of.

second time's the charm

By anders pearson 21 Nov 2005

for the second year in a row, music for robots has been nominated for a PLUG award under the “Music Website Of The Year” category.

last year, pitchfork beat us but i think we have a really good chance of winning this year. since last year, we’ve been on MTV, put out a compilation album, organized some good shows (including our CD release party), had a glowing writeup in the new york times, Ben Gibbard from Death Cab for Cutie plugged us in Wired, and we’ve been kicking all kinds of ass with the quality of posts there.

so go vote for us and tell all your friends to vote for us. also vote for Jesu as metal album of the year even though i wouldn’t really call it metal. as i said last year, we consider it an amazing honor just to be mentioned on the same page as all these great artists. being nominated on the same page as Jesu for me is particularly wonderful.

at the very least, i would like to see mfr kick the pants of myspace (who were recently bought for $580 million dollars by Rupert Murdoch. i have long hated Fox, but now it’s personal.

unicodification

By anders pearson 01 Nov 2005

Unicode is a wonderful thing. it is also occasionally the bane of my existance.

Joel Spolsky has a classic article on The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) that covers the basics nicely. he doesn’t go much into the specifics of dealing with unicode issues in any particular programming language or platform though.

Python does a decent job of making it possible to write applications that are unicode aware. There are some decent pages out there that cover the basics of python and unicode. it’s not very hard. python has two different kinds of internal representations of strings, unicode strings and 8-bit non-unicode strings (basically ASCII). all of python’s built-in functionality and core libraries will work with either just fine. you can mix and match them without having to pay much attention to what kind of string you have. it only gets tricky when python has to deal with an outside system, like I/O, network sockets, or databases. unfortunately, that’s pretty often and the bugs that pop up can be maddening to track down and fix.

the usual scenario is that you build your application and test it and everything works fine. then you release it to the world and the first user who comes along copies and pastes in some text from MS Word with weird “smart” quotes and assorted non-ASCII junk or tries to write in chinese and your precious application chokes and gurgles and starts spitting up arcane UnicodeDecodeError messages all over the users. then you get to spend some quality time with a pile of tracebacks trying to figure out where in your code (or the code of a library you’re using) something isn’t getting encoded properly. half the time, fixing the bug that cropped up creates another, more subtle unicode related bug somewhere else. just a fun time all around.

i’ve been on a unicode kick lately at work and spent some time experimenting and getting very familiar with the unicode related quirks of the particular technology stack that i prefer to work with at the moment: cherrypy, SQLObject, PostgreSQL, simpleTAL, and textile. here are my notes on how i got them all to play nicely together wrt unicode.

the basic strategy is that application code should try to deal with unicode strings at all times and only encode and decode when talking to the browser or some component that for some reason can’t handle unicode strings. whenever a string is encoded, it should be encoded as UTF8 (if you’re writing applications that would mostly be used by eg, chinese speakers though, you might want to go with UTF16 or UTF32, but for most of us, UTF8 is all kinds of goodness).

postgresql

postgresql supports unicode out of the box. however, on gentoo at least, it doesn’t encode databases in UTF8 by default, instead using “SQL_ASCII” or something. i didn’t actually test too much to see what went wrong if you didn’t use a UTF8 encoded database. i would assume that kittens get murdered and the baby jesus cries and all sorts of other horrible things happen. anyway, just remember to create databases with:

% createdb -Eunicode mydatabase

and everything should be fine. converting existing databases isn’t very hard either using iconv. just dump it, convert it, drop the database, recreate it with the right encoding and import:

% pg_dump mydatabase > mydatabase_dump.sql
% iconv -f latin1 -t utf8 mydatabase_dump.sql > mydatabase_dump_utf8.sql
% dropdb mydatabase
% createdb -Eunicode mydatabase
% psql mydatabase -f mydatabase_dump_utf8.sql

cherrypy

cherrypy has encoding and decoding filters that make it a cinch to ensure that the application <-> browser boundary converts everything properly. as long as you have:

cherrypy.config.update({'encodingFilter.on' : True,
                       'encodingFilter.encoding' : 'utf8',
                       'decodingFilter.on' : True})

in the startup, it should do the right thing. all your output will be encoded as UTF8 when it’s sent to the browser, charsets will be set in the headers, and your application will get all its input as nice unicode strings.

SQLObject

SQLObject has the tough job of playing border patrol with the database. for the most part, it just works. it has a UnicodeCol type that makes most operations smooth. so instead of defining a class like:

class Page(SQLObject):
    title = StringCol(length=256)
    body  = StringCol()

you do:

class Page(SQLObject):
    title = UnicodeCol(length=256)
    body  = UnicodeCol()

and all is well. you can do things like:

>>> p = Page(title=u"\u738b\u83f2",body=u"\u738b\u83f2 is a chinese pop star.")
>>> print p.title.encode('utf8')

unicode goes in, unicode comes out. i did discover a few places though that SQLObject wasn’t happy about getting unicode. eg, doing:

>>> results = list(Page.select(Page.q.title == u"\u738b\u83f2"))
Traceback ... etc. big ugly traceback ending in:
 File "/usr/lib/python2.4/site-packages/sqlobject/dbconnection.py", line 295, in _executeRetry
   return cursor.execute(query)
 TypeError: argument 1 must be str, not unicode

so you do have to be careful to encode your strings before doing a query like that. ie, this works:

>>> results = list(Page.select(Page.q.title == u"\u738b\u83f2".encode('utf8')))

since it’s just a wrapper around the same functionality, you need to use the same care with alternateID columns and Table.byColumnName() queries. so

>>> u = User.byUsername(username)

is out and

>>> u = User.byUsername(username.encode('utf8'))

is in.

similarly, it doesn’t like unicode for the orderBy parameter:

>>> r = list(Page.select(Page.q.title == "foo", orderBy=u"title"))

gives you another similar error. this only comes up because i frequently do something like:

:::python
# in some cherrypy controller class
@cherrypy.expose
def search(self, q="", order_by="modified"):
    r = Page.select(Page.q.title == q, orderBy=order_by)
    # ... format the results and send them to the browser

now, using the cherrypy decodingFilter, which otherwise makes unicode errors disappear, the order_by that gets sent in from the browser is a unicode string. once again, you’ll need to make sure you encode it as UTF8.

lastly, EnumCol‘s don’t get converted automatically:

>>> class Ex(SQLObject):
...   foo = EnumCol(enumValues=['a','b','c'])
...
>>> e = Ex(foo=u"a")

will give the usual TypeError exception. it also appears that you just can’t use unicode in EnumCols at all:

>>> class Ex2(SQLObject):
...    foo = EnumCol(enumValues=[u"a",u"b",u"c"])
... 
>>> Ex2.createTable()

will fail right from the start.

i haven’t really done enough research to determine if those issues are bugs in SQLObject, bugs in the python postgres driver (psycopg), bugs in postgresql, or if there are good reasons to be the way they are or if i’m just doing something obviously foolish. either way, they are easily worked around so it’s not that big a deal.

simpleTAL

the basic pattern for how i use simpleTAL with cherrypy is something like:

def tal_template(filename,values):
    from simpletal import simpleTAL, simpleTALES
    import cStringIO
    context = simpleTALES.Context()
    # omitting some stuff i do to set up macros, etc.
    # ...
    for k in values.keys():
        context.addGlobal(k,values[k])
    templatefile = open(filename,'r')
    template = simpleTAL.compileXMLTemplate(templatefile)
    templatefile.close()
    f = cStringIO.StringIO()
    template.expand(context,f)
    return f.getvalue()

this, unfortunately breaks nicely if it comes across any unicode strings in your context. to fix that, you need to specify an outputEncoding on the expand line:

template.expand(context,f,outputEncoding="utf8")

then, since the cherrypy encodingFilter is going to encode all of our output, i change the last line of the function to return a unicode string:

return unicode(f.getvalue(),'utf8')

and it all comes together nicely.

textile

textile, i think tries to be too clever for its own good. unfortunately, if you give it a unicode string with some nice non-ascii characters, you get the dreaded UnicodeDecodeError when it tries to convert it to ascii internally:

>>> from textile import textile
>>> textile(u"\u201d")
... blah blah blah... UnicodeDecodeError

it fairs slightly better if you give it a utf8 encoded string:

>>> textile(u"\u201d".encode('utf8'))
'&lt;p>&amp;#226;&amp;#128;&amp;#157;&lt;/p>'

except that that’s… wrong. rather than spend too much time trying to figure out what textile’s problem was, i reasoned that since it’s purpose in life is just to spit out html, there was no harm in letting python convert the non-ascii characters to XML numerical entities before running it through textile:

>>> textile(u"\u201d".encode('ascii','xmlcharrefreplace'))
'&lt;p>&amp;#8221;&lt;/p>'

which is correct.

[update: 2005-11-02] as discussed in the comments of a post on Sam Ruby’s blog, numerical entities are, in general not a very good solution. it’s better than nothing, but ultimately it looks like i or someone else is going to have to fix textile’s unicode support if i really want things done properly.

memcached (bonus!)

once i’d done all this research, it didn’t take me very long to audit one of our applications at work and get fairly confident that it can now handle anything that’s thrown at it (and of course it now has a bunch more unicode related unit tests to make sure it stays that way).

so this evening i decided to do the same audit on the thraxil.org code. going through the above checklist i had it more or less unicode clean in short order. the only thing i missed at first is that the site uses memcached to cache things and memcached doesn’t automatically marshal unicode strings. so a .encode('utf8') in the set_cache() and a unicode(value,'utf8') in the get_cache() were needed before everything was happy again.

i’m probably missing something, but that’s basically what’s involved in getting a python web application to handle unicode properly. there are some additional shortcuts that i didn’t mention like setting your global default encoding to ‘utf8’ instead of ‘ascii’ but it doesn’t change much, isn’t safe to rely on, and i think it’s useful to understand the details of what’s going on anyway.

for the record, the exact versions i’m using are: Python 2.4, PostgreSQL 8.0.3, cherrypy 2.1, SQLObject 0.7, simpleTAL 3.13, textile 2.0.10, and memcache.py 1.2_tummy5. and psycopg 1.1.15.

the birds get their revenge

By anders pearson 23 Oct 2005

i’ve been reading <a href=”http://www.amazon.com/exec/obidos/tg/detail/-/0684826305/“>The Golden Bough</a> lately. it’s sort of an exhaustive study of old rituals, myths, and superstitions. the other night i came across this passage in the chapter on rituals for transference of evil or illness to animals (page 631):

A Bohemian cure for fever is to go out into the forest before the sun is up and look for a snipe’s nest. When you have found it, take out one of the young birds and keep it beside you for three days. Then go back into the wood and set the snipe free. The fever will leave you at once. The snipe has taken it away. So in Vedic times the Hindoos of old sent consumption away with a blue jay. They said, “O consumption, fly away, fly away with the blue jay! With the wild rush of the storm and the whirlwind, oh, vanish away!” In the village of Llandegla in Wales there is a church dedicated to the virgin martyr St. Tecla, where the falling sickness is, or used to be, cured by being transferred to a fowl. The patient first washed his limbs in a sacred well hard by, dropped fourpence into it as an offering, walked thrice round the well, and thrice repeated the Lord’s prayer. Then the fowl, which was a cock or a hen according as the patient was a man or a woman, was put into a basket and carried round first the well and afterwards the church. Next the sufferer entered the church and lay down under the communion table till break of day. After that he offered sixpence and departed, leaving the fowl in the church. If the bird died, the sickness was supposed to have been transferred to it from the man or woman, who was now rid of the disorder. As late as 1855 the old parish clerk of the village remembered quite well to have seen the birds staggering about from the effects of the fits which had been transferred to them.

reading that, it occurred to me that with all the avian flu stuff going on now, that the tables have turned. now the birds are transferring the sickness back to us.

sad music

By anders pearson 10 Oct 2005

i’ve been playing with last.fm lately. today, on my profile page, which lists the songs i’ve been listening to recently, there was the following google ad:

“Despair Research Depression at WebMD- Learn about Treatment & Symptoms”

apparently google’s algorithms have decided that i’m not listening to happy enough music.

i found it funny, anyway.

improvements

By anders pearson 08 Oct 2005

i’ve been working down my list of stuff that i broke when moving the site to cherrypy and i think i’ve pretty much got it all fixed. if you find something else broken, let me know.

the old engine had a static publishing approach. when you added a post or a comment, it figured out which pages were affected by the change and wrote out new static copies of those files on disk, which apache could then serve without any intense processing. combined with a somewhat byzantine architecture of server side includes, this was quite scalable. the site could handle a pounding from massive amounts of traffic without really breaking a sweat because most of the time, it was just serving up static content.

with cherrypy now, everything is served dynamically, meaning that every time someone visits the frontpage, a whole bunch of python code is run and a bunch of data is pulled out of the database, processed, run through some templates, and sent out to the browser.

this obviously doesn’t scale as well and you may have noticed that page loads were a little slower than before (although, honestly, not as slow as i was expecting them to be). so, have i lost my mind? why would i purposely make the site slower?

my main reason is that by serving pages dynamically, i could drastically simplify the code. the code for calculating which pages were affected by a given update was a huge percentage of the overall code. it made adding any new features or refactoring a daunting task. if the sheer volume of code weren’t enough, any time i made a change to the engine, all the pages on disk essentially needed to be regenerated. i had a little script for that but with thousands of posts and comments in the database, running it would actually take a few hours. so that was another obstacle in the way of making improvements to the site. the overall result was that i let things kind of stagnate for quite a while. with everything generated dynamically, the code is short and clean and any changes i make are instantly reflected with just a browser refresh.

performance with the new code was definitely not as good, but it was actually decent enough to satisfy me for a few days while i finished fixing everything. knowing that benchmarks are good, i did a couple little quick benchmarks requesting the index page (which is one of the more database intensive pages, and, along with the feeds, one of the most heavily trafficked pages) 100 times, ten concurrent requests (using ab2 -n 100 -c 10), i found that it could serve .69 requests per second when requested remotely (thus, with a typical network latency) or .9/sec when requested locally (no network latency, so a better picture of how much actual server load is being caused). not great, but also not as bad as i expected. for comparison, apache serving the old static index gave me 6.8/sec (remote) and 28/sec (local). so it was about an order of magnitude slower. not awful, but bad enough that i would need to do something about it.

tonight, once i got everything i could think of fixed, i explored memcached and appreciated its simplicity. it only took me a couple minutes and a couple lines of code to set up memcached caching of the index page, feeds, and user index pages. the result is 6.0/sec (remote) and 85/sec (local), which makes me very happy. the remote requests are clearly limited by the network connection somewhere between my home machine and thraxil.org so there’s nothing i could do to make that any faster. since memcached keeps everything in RAM, it manages to outperform apache serving a file off disk on the local requests. i’ve got a couple more pages that i want to add caching for but i’m resisting the urge to go hogwild caching everything because i know that that’ll get me back to an ugly mess of code to determine which caches need to be expired on a given update.

of course, i’m also mulling over the possibility of writing some code to cache based on a dependency graph and making that into a cherrypy filter. if i could do it right, it wouldn’t get in the way. but that’s low on my list of priorities right now.

depending on whether i feel more like painting or coding this weekend, i may crank out a few items from my ‘new features and enhancements’ list.

cherry

By anders pearson 05 Oct 2005

for a while, i’ve been porting the engine behind this site to cherrypy little by little. tonight i made a big push and got it all running.

i know some things are still broken. give me a day or two to fix them before you complain too much. i also have grand plans for memcached to speed things up…

No Felony for St. Patrick's Four

By Miguel Diaz 26 Sep 2005

This morning, a jury found the St. Patrick’s Four of the felony conspiracy charges they faced, instead finding them guilty of a couple of misdemeanors with maximum sentences of 1.5 years combined. This is a HUGE victory for civil disobedience, even though I doubt the judge will let them off without jail time come sentencing (prior convictions in other civil disobedience cases will probably screw them).

Here are a couple of links to stories: The St. Patrick’s Four Binghamton Press & Sun Bulletin

For those not familiar with the case, they are four Ithaca residents charged with dumping human blood on an American flag in an Army recruiting station.

Validation, meet Unit Testing. Unit Testing, meet Validation.

By anders pearson 20 Sep 2005

[cross posted from the WaSP to take comments]

Are you test infected? Do you work on dynamic sites and wish there was an automated way to run the output through the W3C validator? Do you wish it was integrated nicely with your unit testing framework?

Scott Raymond has come up with a nice bit of code to add automated validation to the unit tests for a Ruby on Rails application.

If you’re not on Rails, the technique should be pretty straightforward to adapt to your prefered language/framework. Just make a POST request to http://validator.w3.org/check sending parameters fragment (your page, encoded) and output=xml. Then check the response for a header called x-w3c-validator-status to see if it says Valid. If so, your test passed.