I'd ask for code review too, but trust me, only masochists need apply. I've hacked together a library of CPU-specific functions -- a small library, right now just consisting of special versions of memcpy & memset, but hopefully one that will set the foundation for things to come. The software architecture is a combination of the best ideas from this thread. On the level of Fastlib.cpp and 'higher' it's quite nice, with fully granular yet OO support for basically every CPU out there (the included memory routines come in 6 preprocessor-packaged flavors), but it comes at a cost: I would strongly advise not looking at Cpulib.h while operating heavy machinery. Of course, nobody will care unless there's worthwhile performance to be gained by this approach. That's where you come in. I've written a bunch of code for a big variety of CPUs, but am only able to test on one architecture (Athlon 1.4 / DDR). P4 users especially wanted -- for obvious reasons I don't even know if the SSE2 codepaths work at all. All varieties of testing are very welcome though; we'll soon find out if I know as much about cache layouts as I think I do. Basically, I need you to verify that the routine earmarked for your CPU is in fact the fastest one (that you can run w/o generating 'illegal instruction'). I.e., on my machine Athlon > P3 > K6 > P2, but on a P3 it should be the reverse. Instructions: extract EXE file. Run EXE file. Choose option. Repeat a couple times for each option so you know what results are anomalous and can be ignored -- an OS preemption is relatively common, happening most often (to the point of unavoidability) in the big tests but skewing the data the most during small tests...should establish a norm in each category of a few percent at most. Report.