I'm looking for a way to select one table row explicitly for one thread. I've written a crawler, that works with about 50 parallel processes. Every process has to take one row out of a table and process it.
CREATE TABLE `crawler_queue` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `url` text NOT NULL, `class_id` tinyint(3) unsigned NOT NULL, `server_id` tinyint(3) unsigned NOT NULL, `proc_id` mediumint(8) unsigned NOT NULL, `prio` tinyint(3) unsigned NOT NULL, `inserted` int(10) unsigned NOT NULL, PRIMARY KEY (`id`), KEY `proc_id` (`proc_id`), KEY `app_id` (`app_id`), KEY `crawler` (`class_id`,`prio`,`proc_id`) ) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8
Now my processes do the following:
- start DB transaction
- do a select like
SELECT * FROM crawler_queue WHERE class_id=2 AND prio=20 AND proc_id=0 ORDER BY id LIMIT 1 FOR UPDATE
- then update this row with
UPDATE crawler_queue SET server_id=1,proc_id=1376 WHERE id=23892
- comprometer transacción
This should help that no other process can grab a row that is processed yet. Doing an EXPLAIN on the select shows
id select_type table type possible_keys key key_len ref rows Extra 1 SIMPLE crawler_queue ref proc_id,crawler proc_id 3 const 617609 Using where
But the processes seem to cause too high parallelism, because sometimes I can see two types of errors/warnings in my log (every 5 minutes or so):
mysqli::query(): (HY000/1205): Lock wait timeout exceeded; try restarting transaction (in /var/www/db.php l ine 81) mysqli::query(): (40001/1213): Deadlock found when trying to get lock; try restarting transaction (in /var/www/db.php line 81)
My question is: can anybody point me in the right direction to minimize these locking problems? (in production state, the parallelism would be 3-4 times higher than now, so I assume, that there would be much more locking problems)
SELECT to use index
crawler by hint
USE INDEX(crawler). My problem now are lockwait timeouts anymore (deadlocks disappeared).
USE INDEX() shows now (no. of rows is higher, because table contains more data now):
id select_type table type possible_keys key key_len ref rows Extra 1 SIMPLE crawler_queue ref proc_id,crawler crawler 5 const,const,const 5472426 Using where
preguntado el 23 de diciembre de 12 a las 20:12
Your EXPLAIN report shows that you're using only the single-column index
proc_id, and the query has to examine over 600K rows. It would probably be better if the optimizer chose the
InnoDB may be locking all 600K rows, not just the rows that match the full condition in your WHERE clause. InnoDB locks all the examined rows to make sure concurrent changes don't get written to the binlog in the wrong order.
The solution is to use an index to narrow the range of examined rows. This will probably help you not only to find the rows more quickly, but also to avoid locking large ranges of rows. The
crawler index should help here, but it's not immediately clear why it's not using that index.
Usted podría tener que
ANALYZE TABLE to make sure to update InnoDB's table statistics to know about the
crawler index before it uses that index in the optimization plan. ANALYZE TABLE is an inexpensive operation.
The other option is to use an index hint:
SELECT * FROM crawler_queue USE INDEX(crawler) ...
This tells the optimizer to use that index, and do not consider other indexes for this query. I prefer to avoid index hints, because the optimizer is usually able to make good decisions on its own, and using the hint in code means I may be forcing the optimizer not to consider an index I create in the future, which it would otherwise choose.
With more explanation, it's now clear you're using your RDBMS as a FIFO. This is not an efficient use of an RDBMS. There are message queue technologies for this purpose.
From what I can tell the problem that you're facing is that two threads are vyying for the same row in the table and they both can't have it. But there isn't any elegant way for the database to say "no you can't have that one, find another row" and thus you get errors. This is called resource contention.
When you're doing highly parallel work like this one of the easiest ways to reduce contention-based problems is to completely eliminate the contention by inventing a way for all the threads to know which rows they're supposed to work on ahead of time. Then they can lock without having to contend for the resources and your database doesn't have to resolve the contention.
How best to do this? Usually people pick some kind of thread-id scheme and use modulo arithmetic to determine which threads get which rows. If you 10 threads then thread 0 gets row 0, 10, 20, 30, etc. Thread 1 gets 1, 11, 21, 31, etc.
In general if you have NUM_THREADS then each of your threads would pick the ids which are THREAD_ID + i*NUM_THREADS from the database and work on those.
We have introduced a problem in that threads may stall or die, and you could end up with rows in the database which never get touched. There are several solutions to that problem, one of which is to run a "cleanup" once most/all of your threads have finished where all the threads grab piecemeal whatever rows they can and crawl them until there are no un-crawled URLs left. You could get more sophisticated and have a few cleanup threads constantly running, or have each thread occasionally perform cleanup duties, etc.
A better solution would be to do the update and skipping the select entirely. Then you can use
last_insert_id() to pick up the updated item. This should allow you to skip locking completely, while performing the update at the same time. Once the record is updated, you can start processing it, since it will never be selected again by the exact same query, considering not all the initial conditions are matching anymore.
I think this should help you aleviate all the problems related to locking and should allow you to run as many processes as you want in parallel.
PS: Just to clarify, i am talking about
update ... limit 1 to make sure you only update one row.
is the correct one as pointed below.