I’m working on merging a couple of pretty big MySQL databases.

Each database consists of just three tables, but each of those tables has at least 9 million rows.

Because I need to maintain and update a number of data associations, it was necessary to write a custom script to perform the merge. (I chose to use Perl and DBI.)

At one point in my script, I need to do a bunch of “insert this row from database B into database A if it doesn’t exist in database A” type operations. Doing a separate lookup for each record would cost roughly 9 million queries.

At first, I just went ahead and did the insert, allowing the single insert to fail with a unique key error if the record already existed. This caused a bunch of DBD errors to fill the screen, none of which really concerned me, because it didn’t have any bearing on the result. Still, all of those messages printed to STDERR could bury a more serious error, so I looked into handling it another way.

I found this post that recommends using either INSERT IGNORE or ON DUPLICATE KEY to handle these queries. After some investigation, I found that while ON DUPLICATE KEY is a lot faster than INSERT IGNORE, just allowing the thing to fail is the fastest option.

First, some simplified pseudo-code to give you an idea of what I’m doing:

FOR EACH RECORD IN A.TABLE1
   FOR EACH RECORD IN A.TABLE2
      IF THAT RECORD NOT EXISTS IN B.TABLE2,
      DO AN INSERT INTO B.TABLE2

I wanted to avoid doing a separate SQL query at step 3, as that query would have to run 9 million+ times. Since I have a unique key set on the columns I’m interested in, any duplicate rows will not be inserted, thus maintaining my data integrity.

Here are the speed benchmarks I ran for my particular script with each of the three options I tried. The numbers are in terms of iterations of my particular loop (which does a lot more than the pseudo-code above) so the numbers should only be considered relative to each other.

Option 1: Let the query fail with a Unique Key error.

INSERT INTO A (COL1, COL2) VALUES (val1, val2);

Loop iterations per second: 24.5 (avg)

Option 2: INSERT … IGNORE

INSERT IGNORE INTO A (COL1, COL2) VALUES (val1, val2);

Loop iterations per second: 13.3 (avg)

Option 3: INSERT … ON DUPLICATE KEY

INSERT INTO A (COL1, COL2) VALUES (val1, val2) ON DUPLICATE KEY UPDATE id=id;

Loop iterations per second: 18.2 (avg)

As you can see, the fastest of the three options is letting the query fail, followed by ON DUPLICATE KEY as the second fastest.

I’ll leave it to the MySQL experts to explain why this is the case. I have some guesses, but I can’t say for sure.

But I do know one thing: On my setup, for my purposes, letting those queries fail is the quickest option.