A Performance Puzzle: B-Tree Insertions Are Slow on SSDs Bradley C. Kuszmaul MIT and Tokutek The attached graph shows an experiment running insertions into high-entropy indexes on various combinations of - a RAID~10 using rotating disks - a single SSD and - InnoDB, using a B-tree, - the Tokutek Fractal Tree(tm) storage engine, all running under MySQL. The graph shows the performance of the iiBench workload. iiBench was developed originally by Tokutek, and then converted to Python by Mark Callaghan. The benchmark maintains a single SQL table, with an auto-increment primary key, along with three indexes, each of which is essentially random. The benchmark can measure mixed insertion/deletion/query workloads, but the attached graph shows only the insertion workload. These experiments were performed by a third party who has a reputation for tuning InnoDB properly. (I cannot disclose their name now, but I'll be able to in a month or two.) On rotating disk, a Fractal Tree system can support many more high-entropy insertions per second than can B-trees by reducing the number of random disk I/Os by one to two orders of magnitude. So it's not surprising that the performance of the Fractal Tree system is about 10x faster than the B-tree system on the RAID. Since SSDs provide many more random I/Os than disk, I expected that an SSD would make the B-tree run much faster. According to its data sheet, the SSD, an Intel X25-E, can provide 3300 random writes per second and 35,000 random reads per second, as well as 170MB/s of sequential write bandwidth and 250MB/s sequential read bandwidth. In contrast, the RAID which has 5 disks, can sustain about 500 random I/Os/s and 500MB/s bandwidth. Surprisingly, the SSD increased the B-tree performance by only about 50%. The SSD improved the Fractal Tree system performance by only about 10%, which is easy to explain. The system is CPU bound, so speeding up the disk system doesn't make much difference. I don't understand the SSD performance, however. The B-tree is not CPU bound. The B-tree's terminal insertion rate is 2300 rows/s. With 28-byte rows (perhaps 100 bytes/row with all the indexes), the B-tree is only writing 230KB/s to disk, so there remains plenty of unused bandwidth. InnoDB employs an insertion buffer, reducing the actual number of IOs/s by about a factor of 2 to 3 for this workload. At 2300 rows/s, that's 6900 insertions per second into the data structure, but the insertion buffer reduces that to about 2000 disk I/Os/s. It seems that the system should still have plenty of unused random I/O performance. For some reason, the numbers from the data sheet do not match the performance for this kind of workload. Perhaps the data sheet is wrong, or this experiment is flawed, or my analysis is using the wrong performance model for SSD drives. I plan to investigate further. Today (April 2009), a RAID using rotating disk costs similar to a single SSD but provides much more storage space. Today the X-25E costs about $430 and provides 32GB of storage. In contrast, a 1.5TB rotating disk costs about $150 and 250GB drive costs about $60. A 5-disk raid thus provides an 50 times the storage space, and nearly the same iiBench insertion performance, for about the same cost. For the future, the disk drive builders are predicting 30-45% per year growth rate in capacity per rotating drive, which seems likely to match or outpace the increase in SSD density. This generation of SSD is still not a silver bullet that can solve database performance problems. For a given budget, parallel disks are about as good as SSD, and improved data structures provide a much larger bang for the buck. For high-performance transaction systems, rotating disks appear to dominate SSD. This is joint work with Michael A. Bender and Martin Farach-Colton, and was partially supported by NSF, the U.S. Department of Energy, and Tokutek.