I've been playing with callgrind, xdebug, and kcachegrind to profile drupal. I'll go over the details of my results in a future blog entry, but if you want the short story just check out Rasmus's directions. Rasmus has given lots of talks with similar names to "Getting Rich with PHP5" and I'll leave it to the reader to google and find slides from this talk.

Profiling is an iterative process, and after awhile I might blog about it with something more concrete. But for now, I have early results.

A lot of time was spent in or below theme(), with a lot of it ending up being spent in function_exists() - which can be called by theme_get_function() up to 3 times. I drew up a small patch that caches the output of theme_get_function() for the duration of the request.
The patch seemed to give about a 5% speed boost:
without patch:
paine% http_load -fetches 1000 -parallel 1 urls
1000 fetches, 1 max parallel, 1.1147e+07 bytes, in 44.8517 seconds
11147 mean bytes/connection
22.2957 fetches/sec, 248530 bytes/sec
msecs/connect: 0.178162 mean, 3.503 max, 0.116 min, 0.141622 stddev
msecs/first-response: 42.9944 mean, 69.269 max, 41.211 min, 1.56452 stddev
HTTP response codes:
code 200 -- 1000

with patch:
paine% http_load -fetches 1000 -parallel 1 urls
1000 fetches, 1 max parallel, 1.1147e+07 bytes, in 44.136 seconds
11147 mean bytes/connection
22.6572 fetches/sec, 252560 bytes/sec
msecs/connect: 0.167902 mean, 0.318 max, 0.106 min, 0.0264368 stddev
msecs/first-response: 42.2773 mean, 48.323 max, 40.858 min, 0.611673 stddev
HTTP response codes:
code 200 -- 1000
Not bad. Not a huge difference, but a measurable one. My next step will be re-profiling, and seeing where next I can save some cycles.

theme.patch