# Do Not Pass This Way Again Considering MySQL? Use something else. Already on MySQL? Migrate. For every successful project built on MySQL, you could uncover a history of time wasted mitigating MySQL's inadequacies, masked by a hard-won, but meaningless, sense of accomplishment over the effort spent making MySQL behave. Thesis: databases fill roles ranging from pure storage to complex and interesting data processing; MySQL is differently bad at both tasks. Real apps all fall somewhere between these poles, and suffer variably from both sets of MySQL flaws. * MySQL is bad at [storage](#storage). * MySQL is bad at [data processing](#data-processing). ## Storage Storage systems have four properties: 1. Take and store data they receive from applications. 2. Keep that data safe against loss or accidental change. 3. Provide stored data to applications on demand. 4. Give administrators effective management tools. In a truly "pure" storage application, data-comprehension features (constraints and relationships, nontrivial functions and aggregates) would go totally unused. There is a time and a place for this: the return of "NoSQL" storage systems attests to that. Pure storage systems tend to be closely coupled to their "main" application: consider most web/server app databases. "Secondary" clients tend to be read-only (reporting applications, monitoring) or to be utilities in service of the main application (migration tools, documentation tools). If you believe constraints, validity checks, and other comprehension features can be implemented in "the application", you are probably thinking of databases close to this pole. ### Storing Data MySQL has many edge cases which reduce the predictability of its behaviour when storing information. Most of these edge cases are documented, but violate the principle of least surprise (not to mention the expectations of users familiar with other SQL implementations). * Implicit conversions (particularly to and from string types) can modify MySQL's behaviour. * Many implicit conversions are also silent (no warning, no diagnostic), by design, making it more likely developers are entirely unaware of them until one does something surprising. * Conversions that violate basic constraints (range, length) of the output type often coerce data rather than failing. * Sometimes this raises a warning; does your app check for those? * This behaviour is unlike many typed systems (but closely like PHP and remotely like Perl). * Conversion behaviour depends on a per-connection configuration value (`sql_mode`) that has [a large constellation of possible states](http://dev.mysql.com/doc/refman/5.5/en/server-sql-mode.html), making it harder to carry expectations from manual testing over to code or from tool to tool. * MySQL uses non-standard and rather unique interpretations of several common character encodings, including UTF-8 and Latin-1. Implementation details of these encodings within MySQL, such as the `utf8` encoding's MySQL-specific 3-byte limit, tend to leak out into client applications. Data that does not fit MySQL's understanding of the storage encoding will be transformed until it does, by truncation or replacement, by default. ### Preserving Data ... against unexpected changes: like most disk-backed storage systems, MySQL is as reliable as the disks and filesystems its data lives on. MySQL makes very little effort to do its own storage validation and error correction, but this is a limitation shared with many, _many_ other systems. The implicit conversion rules that bite when storing data also bite when asking MySQL to modify data - my favourite example being a fat-fingered `UPDATE` query where a mistyped `=` (as `-`, off by a single key) caused 90% of the rows in the table to be affected, instead of one row, because of implicit string-to-integer conversions. ... against loss: hoo boy. MySQL, out of the box, gives you three approaches to [backups](http://dev.mysql.com/doc/refman/5.5/en/backup-methods.html): * Take "blind" filesystem backups with `tar` or `rsync`. Unless you meticulously lock tables or make the database read-only for the duration, this produces a backup that requires crash recovery before it will be usable, and can produce an inconsistent database. * This can bite quite hard if you use InnoDB, as InnoDB crash recovery takes time proportional to both the number of InnoDB tables and the total size of InnoDB tables, with a large constant. * Dump to SQL with `mysqldump`: slow, relatively large backups, and non-incremental. * Archive binary logs: fragile, complex, over-configurable, and configured badly by default. (Binary logging is also the basis of MySQL's replication system.) If neither of these are sufficient, you're left with purchasing [a backup tool from Oracle](http://dev.mysql.com/doc/refman/5.5/en/glossary.html#glos_mysql_enterprise_backup) or from one of the third-party MySQL vendors. Like many of MySQL's features, the binary logging feature is [too](http://dev.mysql.com/doc/refman/5.5/en/binary-log.html) [configurable](http://dev.mysql.com/doc/refman/5.5/en/replication-options-binary-log.html), while still, somehow, defaulting to modes that are hazardous or surprising: the [default](http://dev.mysql.com/doc/refman/5.5/en/replication-options-binary-log.html#sysvar_binlog_format) [behaviour](http://dev.mysql.com/doc/refman/5.5/en/replication-formats.html) is to log SQL statements, rather than logging their side effects. This has lead to numerous bugs over the years; MySQL (now) makes an effort to make common "non-deterministic" cases such as `NOW()` and `RANDOM()` act deterministically but these have been addressed using ad-hoc solutions. Restoring binary-log-based backups can easily lead to data that differs from the original system, and by the time you've noticed the problem, it's too late to do anything about it. (Seriously. The binary log entries for each statement contain the "current" time on the master and the random seed at the start of the statement, just in case. If your non-deterministic query uses any other function, you're still [fucked by default](http://dev.mysql.com/doc/refman/5.5/en/replication-sbr-rbr.html#replication-sbr-rbr-sbr-disadvantages).) ### Retrieving Data This mostly works as expected. Most of the ways MySQL will screw you happen when you store data, not when you retrieve it. However, there are a few things that implicitly transform stored data before returning it: * `TIMESTAMP` columns (and _only_ `TIMESTAMP` columns) can return apparently-differing values for the same stored value depending on per-connection configuration even during read-only operation. This is done silently and the default behaviour can change as a side effect of non-MySQL configuration changes in the underlying OS. * String-typed columns are transformed for encoding on output if the connection is not using the same encoding as the underlying storage, using the same rules as the transformation on input. * Values that stricter `sql_mode` settings would reject during storage can still be returned during retrieval; it is impossible to predict in advance whether such data exists, since clients are free to set `sql_mode` to any value at any time. ### Efficiency For purely store-and-retrieve applications, MySQL's query planner (which transforms the miniature program contained in each SQL statement into a tree of disk access and data manipulation steps) is sufficient, but only barely. Queries that retrieve data from one table, or from one table and a small number of one-to-maybe-one related tables, produce relatively efficient plans. MySQL, however, offers a number of tuning options that can have dramatic and counterintuitive effects, and the documentation provides very little advice for choosing settings. Tuning relies on the administrator's personal experience, blog articles of varying quality, and consultants. * The MySQL query cache defaults to a non-zero size in some commonly-installed configurations. However, the larger the cache, the slower writes proceed: invalidating cache entries that include the tables modified by a query means considering every entry in the cache. This cache also uses MySQL's LRU implementation, which has its own performance problems during eviction that get worse with larger cache sizes. * Memory-management settings, including `key_buffer_size` and `innodb_buffer_pool_size`, have non-linear relationships with performance. The [standard](http://www.mysqlperformanceblog.com/2006/09/29/what-to-tune-in-mysql-server-after-installation/) [advice](http://www.mysqlperformanceblog.com/2007/11/01/innodb-performance-optimization-basics/) advises making whichever value you care about more to a large value, but this can be counterproductive if the related data is larger than the pool can hold: MySQL is once again bad at discarding old buffer pages when the buffer is exhausted, leading to dramatic slowdowns when query load reaches a certain point. * This also affects filesystem tuning settings such as `table_open_cache`. * InnoDB, out of the box, comes configured to use one large (and automatically growing) tablespace file for all tables, complicating backups and storage management. This is fine for trivial databases, but MySQL provides no tools (aside from `DROP TABLE` and reloading the data from an SQL dump) for transplanting a table to another tablespace, and provides no tools (aside from a filesystem-level `rm`, and reloading _all_ InnoDB data from an SQL dump) for reclaiming empty space in a tablespace file. * MySQL itself provides very few tools to manage storage; tasks like storing large or infrequently-accessed tables and databases on dedicated filesystems must be done on the filesystem, with MySQL shut down. ## Data Processing Data processing encompasses tasks that require making decisions about data and tasks that derive new data from existing data. This is a huge range of topics: * Deciding (and enforcing) application-specific validity rules. * Summarizing and deriving data. * Providing and maintaining alternate representations and structures. * Hosting complex domain logic near the data it operates on. The further towards data processing tasks applications move, the more their SQL resembles tiny programs sent to the data. MySQL is totally unprepared for programs, and expects SQL to retrieve or modify simple rows. ### Validity Good constraints are like `assert`s: in an ideal world, you can't tell if they work, because your code never violates them. Here in the real world, constraint violations happen for all sorts of reasons, ranging from buggy code to buggy human cognition. A good database gives you more places to describe your expectations and more tools for detecting and preventing surprises. MySQL, on the other hand, can't validate your data for you, beyond simple (and fixed) type constraints: * As with the data you store in it, MySQL feels free to change your table definitions [implicitly and silently](http://dev.mysql.com/doc/refman/5.5/en/silent-column-changes.html). Many of these silent schema changes have important performance and feature-availability implications. * Foreign keys, one of the most widely-used database validity checks, are an engine-specific feature, restricting their availabilty in combination with other engine-specific features. (For example, a table cannot have both foreign key constraints and full-text indexes, as of MySQL 5.5.) * Configurations that violate assumptions about foreign keys, such as a foreign key pointing into a MyISAM or NDB table, do not cause warnings or any other diagnostics. The foreign key is simply discarded. SURPRISE. (MySQL is riddled with these sorts of surprises, and apologists lean very heavily on the "that's documented" excuse for its bad behaviour.) * The MySQL parser recognizes `CHECK` clauses, which allow schema developers to make complex declarative assertions about tuples in the database, but [discards them without warning](http://dev.mysql.com/doc/refman/5.5/en/create-table.html). If you want `CHECK`-like constraints, you must implement them as triggers - but see below... * MySQL's comprehension of the `DEFAULT` clause is, uh, limited: only constants are permitted, except for the [special case](https://dev.mysql.com/doc/refman/5.5/en/timestamp-initialization.html) of at most one `TIMESTAMP` column per table. Who designed this mess? * Furthermore, there's no way to say "no default" and raise an error when an INSERT forgets to provide a value. The default `DEFAULT` is either `NULL` or a zero-like constant (`0`, `''`, and so on). Even for types with no meaningful zero-like values (`DATETIME`). * MySQL has no mechanism for introducing new types, which might otherwise provide a route to enforcing validity. Counting the number of special cases in MySQL's [existing type system](http://dev.mysql.com/doc/refman/5.5/en/data-types.html) illustrates why that's probably unfixable. I hope every client with write access to your data is absolutely perfect, because MySQL _cannot help you_ if you make a mistake. ### Summarizing and Deriving Data SQL databases generally provide features for doing "interesting" things with sets of tuples, and MySQL is no exception. However, MySQL's limitations mean that actually processing data in the database is fraught with wasted money, brains, and time: * Aggregate (`GROUP BY`) queries run up against limits in MySQL's query planner: a query with both `WHERE` and `GROUP BY` clauses can only satisfy one constraint or the other with indexes, unless there's an index that covers all the relevant fields in both clauses, in the right order. (What this order is depends on the complexity of the query and on the distribution of the underlying data, but that's hardly MySQL-specific.) * If you have all three of `WHERE`, `GROUP BY`, and `ORDER BY` in the same query, you're more or less fucked. Good luck designing a single index that satisfies all three. * Even though MySQL allows database administrators to [define normal functions in a procedural SQL dialect](http://dev.mysql.com/doc/refman/5.5/en/create-procedure.html), [custom aggregate functions](http://dev.mysql.com/doc/refman/5.5/en/create-function-udf.html) can only be defined by native plugins. Good thing, too, because procedural SQL in MySQL is its own kind of awful - more on that below. * Subqueries are often convenient and occasionally necessary for expressing multi-step transformations on some underlying data. MySQL's query planner has only one strategy for optimizing them: evaluate the innermost query as written, into an in-memory table, then use a nested loop to satisfy joins or `IN` clauses. For large subquery results or interestingly nested subqueries, this is absurdly slow. * MySQL's query planner can't fold constraints from outer queries into subqueries. * The generated in-memory table never has any indexes, ever, even when appropriate indexes are "obvious" from the surrounding query; you cannot even specify them. * These limitations also affect views, which are evaluated as if they were subqueries. In combination with the lack of constraint folding in the planner, this makes filtering or aggregating over large views completely impractical. * MySQL lacks [common table expressions](http://www.postgresql.org/docs/9.2/static/queries-with.html). Even if subquery efficiency problems get fixed, the inability to give meaningful names to subqueries makes them hard to read and comprehend. * I hope you like `CREATE TEMPORARY TABLE AS SELECT`, because that's your only real alternative. * [Window functions](http://en.wikipedia.org/wiki/Select_(SQL)#Window_function) do not exist at all in MySQL. This complicates many kinds of analysis, including time series analyses and ranking analyses. * Specific cases (for example, assigning rank numbers to rows) can be implemented using [server-side variables and side effects during `SELECT`](http://stackoverflow.com/questions/6473800/assigning-row-rank-numbers). What? Good luck understanding that code in six months.