Guide To Using mysqlidxchk
The purpose of mysqlidxchk is to check MySQL databases and tables for indexes
which are not being used. MySQL does not currently have an intrinsic
method or metric to track which indexes are or are not used. Therefore,
a script like mysqlidxchk is needed to discover this information easily.
mysqlidxchk can be used in the scope of an entire schema, or in the scope
of a select number of databases. Either way, mysqlidxchk requires one or more MySQL
slow or general logs (or a "raw" log) in order to compare the indexes used by
the queries in the logs to all available indexes in the given scope. From this comparison,
mysqlidxchk reports which indexes are not used (and, with the
--show-used command line option, which indexes are used).
Therefore, the extent to which mysqlidxchk can accurately report which
indexes are not used is directly related to the extent to which the
queries in the given logs are representative of how the databases are used.
In this guide we will consider a number of topics. First, we'll consider some
practical uses for mysqlidxchk. Then, we'll look more closely at how mysqlidxchk
works. Finally, I will present a short demonstration of mysqlidxchk in action.
Practical Uses For mysqlidxchk
The idea for mysqlidxchk was generously shared with me by Peter Zaitsev, who
was previously employed as Manager of High Performance Group at MySQL AB but
is now an independent
MySQL consultant.
Hence, mysqlidxchk's most obvious practical use is for MySQL consultants who wish
to streamline a database schema. Among a multitude of other tasks, a consultant
will eventually want to remove any unnecessary and unused indexes from the
database schema. With a general log and the --databases command line option,
a consultant can limit mysqlidxchk to only the databases that they are interested in.
Developers can use mysqlidxchk to check for unused indexes—indexes which
they had created earlier but have since abandoned or changed. In the haste of
development sometimes we (or at least I) forget about those "other" indexes.
As developers, we already know the queries which are ran against the database,
therefore we don't really need a MySQL general or slow log. Instead, we can
write our own SQL statements to a "raw" log and use mysqlidxchk with that to check
for forgotten indexes. A "raw" log is simply a text file that contains semi-colon
new-line (;\n) terminated MySQL statements.
In certain cases, system/database administrators can also use mysqlidxchk
to check for unused indexes. Administrators would probably use a combination of
slow and general logs and check every database in the entire schema. For
MySQL servers with many databases and tables, this would be practically impossible
by hand, but mysqlidxchk can do it in a few seconds.
Finally, hackers can use mysqlidxchk to expose the unused indexes of some
third-party application. For example, I often times talk with people who are
fixing and optimizing the database schema and queries of a third-party app.
With mysqlidxchk these hackers can see what extraneous indexes
the third-party app has left lurking in the schema and remove them.
How mysqlidxchk Works
At over 900 lines, mysqlidxchk is not a small script, yet it seems to do
a relatively simple task: find unused indexes. After I began writing mysqlidxchk,
I realized that it had to do a lot "background" work to insure that it
was truly helpful. Otherwise, due to the lack of certain features and functionalities
in MySQL, mysqlidxchk would fail far too easily.
Log Parsing & USE
The first and most important task is log parsing. The log parsing routines in
mysqlidxchk were taken from mysqlsla v1.4 and modified. The first problem to
overcome in log parsing was retaining
USE statements
for each query. As we already know, you cannot
EXPLAIN
a query until you've selected the appropriate database.
For
general logs
it is easy to find and retain the USE statements for every query because every
connection either connects with a database specified or issues an "Init DB" command.
Since all connections are uniquely numbered, mysqlidxchk keeps track
of every connections database usage and saves that information with the connections
queries. In short: for general logs mysqlidxchk always knows which queries use
which databases.
Slow logs
are a different. Slow logs do not always contain USE statements and they do not
contain any other information that would allow mysqlidxchk to easily discover
a query's database. Therefore, for slow logs mysqlidxchk must often rely on
"database discovery" which is discussed below. Alternatively, one could manually
add USE statements to a slow log if necessary.
The final kind of "log" file that mysqlidxchk can parse is what I call a raw log.
A raw log is simply a text file containing SQL statements that are semi-colon new-line
terminated. A single SQL statements can span multiple lines, but it must end with a
";\n". In other words, the next SQL statement has to start on the next line down.
For raw logs, one must add the appropriate USE statements
before the queries
that are to be used with that database. For example, here is a short raw log:
USE db_foo;
SELECT * FROM table WHERE col = 1;
USE db_bar;
UPDATE table SET col = 'hello' WHERE col = 'goodbye';
In this example, the SELECT statement is used with database db_foo and the UPDATE
is used with database db_bar. A USE statement remains in effect for all queries
below it until another USE statement is found.
Database Discovery
In cases where mysqlidxchk has queries with no database (no corresponding
USE statement), it will try to discover which database the query belongs to.
This process of database discovery is actually quite successful, except in
one case. First, lets look at how mysqlidxchk does database discovery.
One of the first things that mysqlidxchk does when ran is read the entire
information schema
(a.k.a. metadata, data dictionary, system catalog). Actually, the current
version of mysqlidxchk does not read MySQL's INFORMATION_SCHEMA database;
it uses
SHOW statements
to discover every index in every table in every database (unless the --databases
option is used, then it only discovers the given databases).
Knowing every table in every database allows mysqlidxchk to find a query's
database by comparing the tables that the query uses to the tables in each
database. For example, suppose you have a database called "foo" with tables
"t1, t2, t3". Now suppose mysqlidxchk encounters a query: "SELECT * FROM t1,
t2 WHERE t1.col = 'happy';". mysqlidxchk will see that every table in the
query is found in database foo. Therefore, it assigns the query to USE foo.
As previously mentioned there is one case in which database discovery fails.
You can probably already guess it. The case is when there are two or more databases
with identically named tables. A case like this can happen on a shared server
with, for example, many different installations of
WordPress. In such a case, the result of
mysqlidxchk's database discovery is undefined. mysqlidxchk will probably pick
the first matching database that it finds. There is no clean solution to this
problem that I am aware of.
In general though, database discovery works well. It even works to find
"missing" databases. If mysqlidxchk has a query which uses a nonexistent
database, it will employ database discovery to find the query's missing
database. Conversely, if mysqlidxchk cannot find the missing database, it
will discard all queries which use the nonexistent database.
See the
mysqlidxchk Documentation for two
command line options related to database discovery: --no-db-discovery
and --no-discovery-report.
UPDATE Statements
Unless the --ignore-update command line is used, mysqlidxchk converts
UPDATE statements into SELECT statements in order to EXPLAIN them. The
method by which mysqlidxchk does this is simple: it converts
"UPDATE table_reference SET column_reference WHERE etc." to
"SELECT * FROM table_reference WHERE etc." Database discovery applies
to UPDATE statements, too.
Showing Used Index
The --show-used option causes mysqlidxchk to also show which indexes are
used. At present, that is about all it does except for one more thing. The
indexes that are marked as used will have a number in parenthesis beside
them. This number is the number of
unique (i.e. abstracted) queries
which use that index. I am open to
suggestions about
what else you would like mysqlidxchk to say about used indexes?
A Short Demonstration of mysqlidxchk
Here follows two mock databases and some queries I created to demonstrate
mysqlidxchk in action. First we have the databases: hacking and hacking2.
hacking contains tables a, b, c. hacking2 contains tables x, y, z.
The first two tables of each database (a, b and x, y) contain one column
which is indexed. The last tables contain three columns which are indexed
in various ways. Here are the indexes for all tables:
mysql> SHOW INDEX FROM a; SHOW INDEX FROM b; SHOW INDEX FROM c;
+-------+------------+-----------+--------------+-------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name |
+-------+------------+-----------+--------------+-------------+
| a | 1 | idx_Ta_C1 | 1 | col_1 |
+-------+------------+-----------+--------------+-------------+
+-------+------------+-----------+--------------+-------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name |
+-------+------------+-----------+--------------+-------------+
| b | 1 | idx_Tb_C1 | 1 | col_1 |
+-------+------------+-----------+--------------+-------------+
+-------+------------+-------------+--------------+-------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name |
+-------+------------+-------------+--------------+-------------+
| c | 1 | idx_Tc_C1 | 1 | col_1 |
| c | 1 | idx_Tc_C2 | 1 | col_2 |
| c | 1 | idx_Tc_C1_2 | 1 | col_1 |
| c | 1 | idx_Tc_C1_2 | 2 | col_2 |
| c | 1 | idx_Tc_C2_1 | 1 | col_2 |
| c | 1 | idx_Tc_C2_1 | 2 | col_1 |
+-------+------------+-------------+--------------+-------------+
mysql> SHOW INDEX FROM x; SHOW INDEX FROM y; SHOW INDEX FROM z;
+-------+------------+-----------+--------------+-------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name |
+-------+------------+-----------+--------------+-------------+
| x | 1 | idx_Tx_C1 | 1 | col_1 |
+-------+------------+-----------+--------------+-------------+
+-------+------------+-----------+--------------+-------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name |
+-------+------------+-----------+--------------+-------------+
| y | 1 | idx_Ty_C1 | 1 | col_1 |
+-------+------------+-----------+--------------+-------------+
+-------+------------+-------------+--------------+-------------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name |
+-------+------------+-------------+--------------+-------------+
| z | 1 | idx_Tz_C3_2 | 1 | col_3 |
| z | 1 | idx_Tz_C3_2 | 2 | col_2 |
| z | 1 | idx_Tz_C1_2 | 1 | col_1 |
| z | 1 | idx_Tz_C1_2 | 2 | col_2 |
| z | 1 | idx_Tz_C1 | 1 | col_1 |
+-------+------------+-------------+--------------+-------------+
Now, here are some queries from a raw log for these databases and tables:
USE hacking;
SELECT * FROM c;
SELECT * FROM a AS apple, b AS banana
WHERE apple.col_1 = banana.col_1;
USE hacking2;
SELECT * FROM x, y, z LIMIT 2;
SELECT z.col_3 FROM x, z WHERE z.col_1 = x.col_1;
SELECT bar.col_3 FROM x foo, z bar WHERE bar.col_1 = foo.col_1;
SELECT x.col_1, y.col_1, z.* FROM x, y, z
WHERE x.col_1 = y.col_1 and x.col_1 = z.col_1;
USE hacking;
SELECT cat.*, a.*, boy.*
FROM
c AS cat,
b boy,
a
WHERE
a.col_1 = cat.col_2 AND cat.col_1 NOT IN (2,4, 6, 8);
USE hacking;
UPDATE a SET col_1 = 2 WHERE col_1 IN (2, 4, 6);
UPDATE b AS boy SET col_1 = NULL WHERE 1;
USE hacking2;
UPDATE LOW_PRIORITY z zoot SET col_1 + 1, col_2 = 3 WHERE col_3 IS NOT NULL LIMIT 2;
UPDATE IGNORE x, y AS yak, z zar SET col_1 = 0 WHERE zar.col_1 IS NULL;
After inserting some random data into the tables I ran mysqlidxchk:
# ./mysqlidxchk --raw test_raw --databases hacking,hacking2 --show-used
Reading raw log 'test_raw'.
11 total valid queries, 11 unique.
Only using databases: hacking hacking2
Database: hacking
Table: a
Index: idx_Ta_C1 used (3)
Table: b
Index: idx_Tb_C1 used (3)
Table: c
Index: idx_Tc_C1 NOT used
Index: idx_Tc_C1_2 NOT used
Index: idx_Tc_C2 NOT used
Index: idx_Tc_C2_1 NOT used
Database: hacking2
Table: foo
Index: idx_Tx_C1 used (1)
Table: x
Index: idx_Tx_C1 used (4)
Table: y
Index: idx_Ty_C1 used (3)
Table: z
Index: idx_Tz_C1 NOT used
Index: idx_Tz_C1_2 used (1)
Index: idx_Tz_C3_2 used (1)
And that is all there is to it. If this were a real database, I would probably
drop all those unused indexes.
For now mysqlidxchk is very simple with just one report. However, its one
report is quite helpful in streamlining one more crucial part of your databases
and tables: indexes.
(Doc rev: Mar 31 2007)