By
anders pearson
27 Nov 2005
there’s been some controversy lately over at flickr about their policy of not allowing drawings, paintings, illustrations, etc. on the site. well, “not allowing” isn’t quite accurate: they don’t delete the images, they just flag a user’s account as “Not Public Site Areas“ (aka ‘NIPSA’) which means that none of a flagged user’s images will show up in global tag pages or group pools (except to logged in flickr users who are actually signed up to the group). but since flickr’s real draw is the community aspect, being NIPSA’d effectively cuts a user out of the community; they might as well just have their account shut off.
ignoring the issues of whether drawings, paintings, illustrations etc. have any overall negative effect on flickr, whether this is a very “web 2.0“ policy for a service often touted as one of the leaders of the whole “web 2.0” scene, whether it’s good business practice to purposefully alienate and frustrate a thriving and enthusiastic sector of their customer-base, and even ignoring the fact that they flag accounts in this way without in any way notifying the users that they’ve been flagged, i think this raises deeper philosophical questions.
reading through the threads in flickr’s forum, the flickr admins seem genuinely astounded that anyone could have an issue with this policy. the common refrain repeated over and over again is that “flickr is a photosharing site. it’s for sharing photos only” so why is anyone surprised?
my natural response is to ask, “well, what is a ‘photo’, anyway?”
what exactly is the magical defining quality that makes one image a ‘photo’ and another a ‘non-photo’ and thus not suitable to be posted on flickr? where is the line?
let’s start with something we can probably all agree on. here’s a photo from my flickr stream. it was taken with my digital camera and uploaded with no processing (aside from flickr resizing it). it might not be a great photo, but it’s pretty typical of what’s on flickr.
<img src=”http://static.flickr.com/26/67009686_01b1c4b9ef_m.jpg” width=”240” height=”180” alt=”melon grate” />
now, right off, we run into the issue that this is a digital photo. accepting a digital photo as a photo is a relatively new phenomenon in the photographic community. i don’t think you’d have to look very hard to find some stodgy old gray-beard photographers who would still insist that if it doesn’t involve film and chemicals and a dark room, it’s not “real” photography. but most of us aren’t that snobby so we’ll agree that a digital photo taken with a digital camera is still a “photo”.
next is the issue of PhotoShop. this is where things get murky real fast. most professional photographers use photoshop or some other image editing software to post-process their photos. cropping them, adjusting the contrast, removing red-eye. these are all common operations and most people wouldn’t revoke an image’s status as a “photo” because of them. of course, once again, you can also find communities of digital photographers who shun the use of photoshop and insist that it only counts if the image is left exactly as the camera recorded it. every profession or hobby has its share of cranks.
but how much can you really get away with? can i crop a photo down to one pixel and still have it be a ‘photo’? if the answer is that yes, it’s still a ‘photo’, then what about another image which consists of a single pixel of the same color except that it was created entirely digitally without light ever being reflected off an object, passing through a lens and onto a photo-sensitive surface? the resulting image files will be exactly identical so how could one justify a difference?
here are two images, one the result of photoshopping a digital camera photo (actually using the Gimp, not PhotoShop, but same diff) and the other an immaculate digital creation. can you tell which is which?
<img src=”http://static.flickr.com/35/67717664_2d87d2d5d2_o.jpg” width=”200” height=”200” alt=”black1” /><img src=”http://static.flickr.com/24/67717665_8917870683_o.jpg” width=”200” height=”200” alt=”black2” />
if there is some point at which manipulating a photo makes it no longer a photo, where exactly is that point? does cropping a photo down to less than 11% of its original size change it’s nature while 12% percent is ok? does it depend on what the subject matter of the photo is?
if i take this photo of a painting:
<img src=”http://static.flickr.com/34/67719740_22f8712214_m.jpg” width=”240” height=”180” alt=”imgp3769” />
and crop it down and clean it up into this:
<img src=”http://static.flickr.com/30/42430283_a7b597830e_m.jpg” width=”240” height=”185” alt=”roots1” />
does that make it no longer a photo? or is any photo with a painting in it at all, not a “photo”? even if it’s not the focus of the picture?
<img src=”http://static.flickr.com/27/38953056_dff09f669e_m.jpg” width=”240” height=”180” alt=”Buenos Aires 2005 - lani, sveta, eduardo’s apt” />
what percentage of the image is the painting allowed to take up and still be considered a “photo”? what if someone uses photoshop to composite several images together like this?
is that still a photo? is it only because all of the sub-images pass as “photos”? what if one out of the four were a “non-photo”? two out of four? where’s the cutoff?
personally, i would call it a “photo of a painting”. actually, i would probably just call everything an “image” and leave it at that. but these are the kinds of photos that flickr has decided are not photos.
anyone’s who’s seriously tried to take decent photographs of paintings or drawings also knows that it’s not a trivial task. actually, it’s a royal pain in the ass to get it to come out right and requires some real photographic skills like an understanding of lighting and focus and depth-of-field issues. that’s why my photos of paintings suck; i’m not a very good photographer.
ok, i don’t think it’s too much of a stretch to argue that a photo of a painting or drawing is still a ‘photo’. what about scanned drawings and illustrations? does ‘photo’ mean that at some point in its life, light must have passed into some device that we label a “camera”? a scanner functions very similarly to a camera, but i guess you could argue that it’s different enough that it doesn’t really count as a “camera” and thus images that it creates aren’t “photos”.
so, how about this “photo” taken with a 35mm film camera and scanned in?
<img src=”http://static.flickr.com/21/27466986_0f1c09a9d5_m.jpg” width=”174” height=”240” alt=”001” />
does passing through a scanner strip it of its “photo” nature? how about this photo of a painting taken with a 35mm film camera and then scanned in:
<img src=”http://static.flickr.com/22/27387448_456ac24f02_m.jpg” width=”172” height=”240” alt=”wreck” />
if a “photo” can only come from a camera, what about photograms, the staple of introductory photography classes? are they “photos”?
what about scanner photography? are scans of flowers ok but scans of pieces of paper with ink on them not?
clearly, i think this whole business is absurd, arbitrary, and petty. i think flickr should lighten up, remove the non-photo related NIPSA flags from accounts and promise never to do it again. flickr happens to be a great tool for sharing drawings, paintings, and illustrations whether they’re “photos” or not and i think they would do well to embrace that rather than start punishing their customers for using the service in a way that they hadn’t thought of.
By
anders pearson
01 Nov 2005
Unicode is a wonderful thing. it is also occasionally the bane of my existance.
Joel Spolsky has a classic article on The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) that covers the basics nicely. he doesn’t go much into the specifics of dealing with unicode issues in any particular programming language or platform though.
Python does a decent job of making it possible to write applications that are unicode aware. There are some decent pages out there that cover the basics of python and unicode. it’s not very hard. python has two different kinds of internal representations of strings, unicode strings and 8-bit non-unicode strings (basically ASCII). all of python’s built-in functionality and core libraries will work with either just fine. you can mix and match them without having to pay much attention to what kind of string you have. it only gets tricky when python has to deal with an outside system, like I/O, network sockets, or databases. unfortunately, that’s pretty often and the bugs that pop up can be maddening to track down and fix.
the usual scenario is that you build your application and test it and everything works fine. then you release it to the world and the first user who comes along copies and pastes in some text from MS Word with weird “smart” quotes and assorted non-ASCII junk or tries to write in chinese and your precious application chokes and gurgles and starts spitting up arcane UnicodeDecodeError
messages all over the users. then you get to spend some quality time with a pile of tracebacks trying to figure out where in your code (or the code of a library you’re using) something isn’t getting encoded properly. half the time, fixing the bug that cropped up creates another, more subtle unicode related bug somewhere else. just a fun time all around.
i’ve been on a unicode kick lately at work and spent some time experimenting and getting very familiar with the unicode related quirks of the particular technology stack that i prefer to work with at the moment: cherrypy, SQLObject, PostgreSQL, simpleTAL, and textile. here are my notes on how i got them all to play nicely together wrt unicode.
the basic strategy is that application code should try to deal with unicode strings at all times and only encode and decode when talking to the browser or some component that for some reason can’t handle unicode strings. whenever a string is encoded, it should be encoded as UTF8
(if you’re writing applications that would mostly be used by eg, chinese speakers though, you might want to go with UTF16
or UTF32
, but for most of us, UTF8
is all kinds of goodness).
postgresql
postgresql supports unicode out of the box. however, on gentoo at least, it doesn’t encode databases in UTF8 by default, instead using “SQL_ASCII” or something. i didn’t actually test too much to see what went wrong if you didn’t use a UTF8 encoded database. i would assume that kittens get murdered and the baby jesus cries and all sorts of other horrible things happen. anyway, just remember to create databases with:
% createdb -Eunicode mydatabase
and everything should be fine. converting existing databases isn’t very hard either using iconv. just dump it, convert it, drop the database, recreate it with the right encoding and import:
% pg_dump mydatabase > mydatabase_dump.sql
% iconv -f latin1 -t utf8 mydatabase_dump.sql > mydatabase_dump_utf8.sql
% dropdb mydatabase
% createdb -Eunicode mydatabase
% psql mydatabase -f mydatabase_dump_utf8.sql
cherrypy
cherrypy has encoding and decoding filters that make it a cinch to ensure that the application <-> browser boundary converts everything properly. as long as you have:
cherrypy.config.update({'encodingFilter.on' : True,
'encodingFilter.encoding' : 'utf8',
'decodingFilter.on' : True})
in the startup, it should do the right thing. all your output will be encoded as UTF8 when it’s sent to the browser, charsets will be set in the headers, and your application will get all its input as nice unicode strings.
SQLObject
SQLObject has the tough job of playing border patrol with the database. for the most part, it just works. it has a UnicodeCol type that makes most operations smooth. so instead of defining a class like:
class Page(SQLObject):
title = StringCol(length=256)
body = StringCol()
you do:
class Page(SQLObject):
title = UnicodeCol(length=256)
body = UnicodeCol()
and all is well. you can do things like:
>>> p = Page(title=u"\u738b\u83f2",body=u"\u738b\u83f2 is a chinese pop star.")
>>> print p.title.encode('utf8')
unicode goes in, unicode comes out. i did discover a few places though that SQLObject wasn’t happy about getting unicode. eg, doing:
>>> results = list(Page.select(Page.q.title == u"\u738b\u83f2"))
Traceback ... etc. big ugly traceback ending in:
File "/usr/lib/python2.4/site-packages/sqlobject/dbconnection.py", line 295, in _executeRetry
return cursor.execute(query)
TypeError: argument 1 must be str, not unicode
so you do have to be careful to encode your strings before doing a query like that. ie, this works:
>>> results = list(Page.select(Page.q.title == u"\u738b\u83f2".encode('utf8')))
since it’s just a wrapper around the same functionality, you need to use the same care with alternateID columns and Table.byColumnName() queries. so
>>> u = User.byUsername(username)
is out and
>>> u = User.byUsername(username.encode('utf8'))
is in.
similarly, it doesn’t like unicode for the orderBy
parameter:
>>> r = list(Page.select(Page.q.title == "foo", orderBy=u"title"))
gives you another similar error. this only comes up because i frequently do something like:
:::python
# in some cherrypy controller class
@cherrypy.expose
def search(self, q="", order_by="modified"):
r = Page.select(Page.q.title == q, orderBy=order_by)
# ... format the results and send them to the browser
now, using the cherrypy decodingFilter
, which otherwise makes unicode errors disappear, the order_by
that gets sent in from the browser is a unicode string. once again, you’ll need to make sure you encode it as UTF8
.
lastly, EnumCol
‘s don’t get converted automatically:
>>> class Ex(SQLObject):
... foo = EnumCol(enumValues=['a','b','c'])
...
>>> e = Ex(foo=u"a")
will give the usual TypeError
exception. it also appears that you just can’t use unicode in EnumCol
s at all:
>>> class Ex2(SQLObject):
... foo = EnumCol(enumValues=[u"a",u"b",u"c"])
...
>>> Ex2.createTable()
will fail right from the start.
i haven’t really done enough research to determine if those issues are bugs in SQLObject, bugs in the python postgres driver (psycopg), bugs in postgresql, or if there are good reasons to be the way they are or if i’m just doing something obviously foolish. either way, they are easily worked around so it’s not that big a deal.
simpleTAL
the basic pattern for how i use simpleTAL with cherrypy is something like:
def tal_template(filename,values):
from simpletal import simpleTAL, simpleTALES
import cStringIO
context = simpleTALES.Context()
# omitting some stuff i do to set up macros, etc.
# ...
for k in values.keys():
context.addGlobal(k,values[k])
templatefile = open(filename,'r')
template = simpleTAL.compileXMLTemplate(templatefile)
templatefile.close()
f = cStringIO.StringIO()
template.expand(context,f)
return f.getvalue()
this, unfortunately breaks nicely if it comes across any unicode strings in your context. to fix that, you need to specify an outputEncoding on the expand line:
template.expand(context,f,outputEncoding="utf8")
then, since the cherrypy encodingFilter is going to encode all of our output, i change the last line of the function to return a unicode string:
return unicode(f.getvalue(),'utf8')
and it all comes together nicely.
textile
textile, i think tries to be too clever for its own good. unfortunately, if you give it a unicode string with some nice non-ascii characters, you get the dreaded UnicodeDecodeError
when it tries to convert it to ascii internally:
>>> from textile import textile
>>> textile(u"\u201d")
... blah blah blah... UnicodeDecodeError
it fairs slightly better if you give it a utf8 encoded string:
>>> textile(u"\u201d".encode('utf8'))
'<p>&#226;&#128;&#157;</p>'
except that that’s… wrong. rather than spend too much time trying to figure out what textile’s problem was, i reasoned that since it’s purpose in life is just to spit out html, there was no harm in letting python convert the non-ascii characters to XML numerical entities before running it through textile:
>>> textile(u"\u201d".encode('ascii','xmlcharrefreplace'))
'<p>&#8221;</p>'
which is correct.
[update: 2005-11-02] as discussed in the comments of a post on Sam Ruby’s blog, numerical entities are, in general not a very good solution. it’s better than nothing, but ultimately it looks like i or someone else is going to have to fix textile’s unicode support if i really want things done properly.
memcached (bonus!)
once i’d done all this research, it didn’t take me very long to audit one of our applications at work and get fairly confident that it can now handle anything that’s thrown at it (and of course it now has a bunch more unicode related unit tests to make sure it stays that way).
so this evening i decided to do the same audit on the thraxil.org code. going through the above checklist i had it more or less unicode clean in short order. the only thing i missed at first is that the site uses memcached to cache things and memcached doesn’t automatically marshal unicode strings. so a .encode('utf8')
in the set_cache()
and a unicode(value,'utf8')
in the get_cache()
were needed before everything was happy again.
i’m probably missing something, but that’s basically what’s involved in getting a python web application to handle unicode properly. there are some additional shortcuts that i didn’t mention like setting your global default encoding to ‘utf8’ instead of ‘ascii’ but it doesn’t change much, isn’t safe to rely on, and i think it’s useful to understand the details of what’s going on anyway.
for the record, the exact versions i’m using are: Python 2.4, PostgreSQL 8.0.3, cherrypy 2.1, SQLObject 0.7, simpleTAL 3.13, textile 2.0.10, and memcache.py 1.2_tummy5. and psycopg 1.1.15.
By
anders pearson
08 Oct 2005
i’ve been working down my list of stuff that i broke when moving the site to cherrypy and i think i’ve pretty much got it all fixed. if you find something else broken, let me know.
the old engine had a static publishing approach. when you added a post or a comment, it figured out which pages were affected by the change and wrote out new static copies of those files on disk, which apache could then serve without any intense processing. combined with a somewhat byzantine architecture of server side includes, this was quite scalable. the site could handle a pounding from massive amounts of traffic without really breaking a sweat because most of the time, it was just serving up static content.
with cherrypy now, everything is served dynamically, meaning that every time someone visits the frontpage, a whole bunch of python code is run and a bunch of data is pulled out of the database, processed, run through some templates, and sent out to the browser.
this obviously doesn’t scale as well and you may have noticed that page loads were a little slower than before (although, honestly, not as slow as i was expecting them to be). so, have i lost my mind? why would i purposely make the site slower?
my main reason is that by serving pages dynamically, i could drastically simplify the code. the code for calculating which pages were affected by a given update was a huge percentage of the overall code. it made adding any new features or refactoring a daunting task. if the sheer volume of code weren’t enough, any time i made a change to the engine, all the pages on disk essentially needed to be regenerated. i had a little script for that but with thousands of posts and comments in the database, running it would actually take a few hours. so that was another obstacle in the way of making improvements to the site. the overall result was that i let things kind of stagnate for quite a while. with everything generated dynamically, the code is short and clean and any changes i make are instantly reflected with just a browser refresh.
performance with the new code was definitely not as good, but it was actually decent enough to satisfy me for a few days while i finished fixing everything. knowing that benchmarks are good, i did a couple little quick benchmarks requesting the index page (which is one of the more database intensive pages, and, along with the feeds, one of the most heavily trafficked pages) 100 times, ten concurrent requests (using ab2 -n 100 -c 10
), i found that it could serve .69 requests per second when requested remotely (thus, with a typical network latency) or .9/sec when requested locally (no network latency, so a better picture of how much actual server load is being caused). not great, but also not as bad as i expected. for comparison, apache serving the old static index gave me 6.8/sec (remote) and 28/sec (local). so it was about an order of magnitude slower. not awful, but bad enough that i would need to do something about it.
tonight, once i got everything i could think of fixed, i explored memcached and appreciated its simplicity. it only took me a couple minutes and a couple lines of code to set up memcached caching of the index page, feeds, and user index pages. the result is 6.0/sec (remote) and 85/sec (local), which makes me very happy. the remote requests are clearly limited by the network connection somewhere between my home machine and thraxil.org so there’s nothing i could do to make that any faster. since memcached keeps everything in RAM, it manages to outperform apache serving a file off disk on the local requests. i’ve got a couple more pages that i want to add caching for but i’m resisting the urge to go hogwild caching everything because i know that that’ll get me back to an ugly mess of code to determine which caches need to be expired on a given update.
of course, i’m also mulling over the possibility of writing some code to cache based on a dependency graph and making that into a cherrypy filter. if i could do it right, it wouldn’t get in the way. but that’s low on my list of priorities right now.
depending on whether i feel more like painting or coding this weekend, i may crank out a few items from my ‘new features and enhancements’ list.