Seleccione solo una fila de la tabla en conexiones paralelas altas

I'm looking for a way to select one table row explicitly for one thread. I've written a crawler, that works with about 50 parallel processes. Every process has to take one row out of a table and process it.

CREATE TABLE `crawler_queue` (
 `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
 `url` text NOT NULL,
 `class_id` tinyint(3) unsigned NOT NULL,
 `server_id` tinyint(3) unsigned NOT NULL,
 `proc_id` mediumint(8) unsigned NOT NULL,
 `prio` tinyint(3) unsigned NOT NULL,
 `inserted` int(10) unsigned NOT NULL,
 PRIMARY KEY (`id`),
 KEY `proc_id` (`proc_id`),
 KEY `app_id` (`app_id`),
 KEY `crawler` (`class_id`,`prio`,`proc_id`)

Now my processes do the following:

  • start DB transaction
  • do a select like SELECT * FROM crawler_queue WHERE class_id=2 AND prio=20 AND proc_id=0 ORDER BY id LIMIT 1 FOR UPDATE
  • then update this row with UPDATE crawler_queue SET server_id=1,proc_id=1376 WHERE id=23892
  • comprometer transacción

This should help that no other process can grab a row that is processed yet. Doing an EXPLAIN on the select shows

id  select_type  table          type  possible_keys    key      key_len  ref    rows    Extra
1   SIMPLE       crawler_queue  ref   proc_id,crawler  proc_id  3        const  617609  Using where

But the processes seem to cause too high parallelism, because sometimes I can see two types of errors/warnings in my log (every 5 minutes or so):

mysqli::query(): (HY000/1205): Lock wait timeout exceeded; try restarting transaction (in /var/www/db.php l
ine 81)

mysqli::query(): (40001/1213): Deadlock found when trying to get lock; try restarting transaction (in /var/www/db.php line 81)

My question is: can anybody point me in the right direction to minimize these locking problems? (in production state, the parallelism would be 3-4 times higher than now, so I assume, that there would be much more locking problems)

Modifiqué SELECT to use index crawler by hint USE INDEX(crawler). My problem now are lockwait timeouts anymore (deadlocks disappeared).

EXPLAIN con USE INDEX() shows now (no. of rows is higher, because table contains more data now):

id  select_type  table          type  possible_keys    key      key_len  ref                rows     Extra
1   SIMPLE       crawler_queue  ref   proc_id,crawler  crawler  5        const,const,const  5472426  Using where

preguntado el 23 de diciembre de 12 a las 20:12

3 Respuestas

Your EXPLAIN report shows that you're using only the single-column index proc_id, and the query has to examine over 600K rows. It would probably be better if the optimizer chose the crawler índice.

InnoDB may be locking all 600K rows, not just the rows that match the full condition in your WHERE clause. InnoDB locks all the examined rows to make sure concurrent changes don't get written to the binlog in the wrong order.

The solution is to use an index to narrow the range of examined rows. This will probably help you not only to find the rows more quickly, but also to avoid locking large ranges of rows. The crawler index should help here, but it's not immediately clear why it's not using that index.

Usted podría tener que ANALYZE TABLE to make sure to update InnoDB's table statistics to know about the crawler index before it uses that index in the optimization plan. ANALYZE TABLE is an inexpensive operation.

The other option is to use an index hint:

SELECT * FROM crawler_queue USE INDEX(crawler) ...

This tells the optimizer to use that index, and do not consider other indexes for this query. I prefer to avoid index hints, because the optimizer is usually able to make good decisions on its own, and using the hint in code means I may be forcing the optimizer not to consider an index I create in the future, which it would otherwise choose.

With more explanation, it's now clear you're using your RDBMS as a FIFO. This is not an efficient use of an RDBMS. There are message queue technologies for this purpose.

Ver también:

contestado el 23 de mayo de 17 a las 13:05

Hey Bill, that's what I've done yet (sorry for not updating my question, will give +1 though). But what's strange is, that explain has shown sometimes the usage of crawler en lugar de proc_id. But for now I force usage of index crawler. I'll give the analyze table command a try, also. Thanks - rabudo

Asegúrate de leer rows field in the EXPLAIN output. With the compound index, the number of rows examined should be lower, I hope. - Bill Karwin

No, it doesn't (see above) - rabudo

What about forcing usage of primary key? This table works like a FIFO, so the process looks for the first free row. In most cases the first 400-500 rows are already processed by other threads, so a USE INDEX(PRIMARY) should request only some hundred rows, doesn't it? - rabudo

From what I can tell the problem that you're facing is that two threads are vyying for the same row in the table and they both can't have it. But there isn't any elegant way for the database to say "no you can't have that one, find another row" and thus you get errors. This is called resource contention.

When you're doing highly parallel work like this one of the easiest ways to reduce contention-based problems is to completely eliminate the contention by inventing a way for all the threads to know which rows they're supposed to work on ahead of time. Then they can lock without having to contend for the resources and your database doesn't have to resolve the contention.

How best to do this? Usually people pick some kind of thread-id scheme and use modulo arithmetic to determine which threads get which rows. If you 10 threads then thread 0 gets row 0, 10, 20, 30, etc. Thread 1 gets 1, 11, 21, 31, etc.

In general if you have NUM_THREADS then each of your threads would pick the ids which are THREAD_ID + i*NUM_THREADS from the database and work on those.

We have introduced a problem in that threads may stall or die, and you could end up with rows in the database which never get touched. There are several solutions to that problem, one of which is to run a "cleanup" once most/all of your threads have finished where all the threads grab piecemeal whatever rows they can and crawl them until there are no un-crawled URLs left. You could get more sophisticated and have a few cleanup threads constantly running, or have each thread occasionally perform cleanup duties, etc.

Respondido el 29 de diciembre de 12 a las 19:12

A better solution would be to do the update and skipping the select entirely. Then you can use last_insert_id() to pick up the updated item. This should allow you to skip locking completely, while performing the update at the same time. Once the record is updated, you can start processing it, since it will never be selected again by the exact same query, considering not all the initial conditions are matching anymore.

I think this should help you aleviate all the problems related to locking and should allow you to run as many processes as you want in parallel.

PS: Just to clarify, i am talking about update ... limit 1 to make sure you only update one row.

EDIT: Solución

is the correct one as pointed below.

contestado el 23 de mayo de 17 a las 13:05

Buena idea, pero LAST_INSERT_ID() will return a value only if you INSERT data or an UPDATE increments the autoincrement column: EDITAR daré… un intento - rabudo

For some reason i got a last_insert_id value when i tested, but it tricked me (it looked like the correct one, but it was not). I believe the solution described in that SO question is the way to go. I will update my answer as well. - xruido

No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas or haz tu propia pregunta.