Terminating recalcitrant Redshift threads


Step 1: There were two nasty queries active that wouldn’t cancel/terminate:

  pid  | seconds |     user      |                                               querytxt
-------+---------+---------------+------------------------------------------------------------------------------------------------------
  9315 |   22223 | rsrootuser    | /*MTLN-1.37.4 (build 264)*/                                                                         +
       |         |               | DROP TABLE IF EXISTS "analysis_ready"."airdna_airbnb_properties" CASCADE
 23925 |   14125 | rsrootuser    | /*MTLN-1.37.4 (build 264)*/                                                                         +
       |         |               |  select                                                                                             +
       |         |               |     concat('airbnb', p.property_id::varchar) propertymap_id,                                        +
       |         |               |
 69872 |     691 | danieljohnson | select date, book::decimal / (book+avail)::decimal occ, (book+avail) numunits from\r                +
       |         |               | (select \r                                                                                          +
       |         |               |   a.da
 71306 |     419 | looker        | -- Looker Query Context '{"user_id":855,"history_id":21653565,"instance_slug":"d81b23982577fd88f0724
 69477 |       2 | rsrootuser    | COPY "etl_testing"."e5alSBc9" ("appId","appName","browser","created","deviceType","emailCampaignId",
 72968 |       1 | rsrootuser    | /*MTLN-1.37.4 (build 264)*/                                                                         +
       |         |               | COPY "high_risk"."units_fraud_src" ("unit_id", "unit_code", "date_first_
(6 rows)

Using `CANCEL`, `ABORT` nor `pg_terminate_backend` would terminate the query.

Step 2: My assumption was that there must be a lock, so I checked:

warehouse=# select * from stv_locks;
 table_id |        last_commit         |        last_update         | lock_owner | lock_owner_pid |    lock_owner_start_ts     |  lock_owner_end_ts  |      lock_status
----------+----------------------------+----------------------------+------------+----------------+----------------------------+---------------------+------------------------
   197152 | 2016-04-17 16:53:24.359937 | 2016-04-17 17:02:29.989843 |  101405934 |          69518 | 2016-04-17 17:02:29.598797 | 2000-01-01 00:00:00 | Holding write lock
   197166 | 2016-04-17 16:53:24.359937 | 2016-04-17 17:02:31.52808  |  101405934 |          69518 | 2016-04-17 17:02:29.598797 | 2000-01-01 00:00:00 | Holding write lock
   197166 | 2016-04-17 16:53:24.359937 | 2016-04-17 17:02:31.531325 |  101405934 |          69518 | 2016-04-17 17:02:29.598797 | 2000-01-01 00:00:00 | Holding  insert lock
 15195795 | 2016-04-17 16:45:42.988043 | 2016-04-17 17:02:31.401243 |  101405969 |          69473 | 2016-04-17 17:02:30.94705  | 2000-01-01 00:00:00 | Holding write lock
 15195795 | 2016-04-17 16:45:42.988043 | 2016-04-17 17:02:31.410358 |  101405969 |          69473 | 2016-04-17 17:02:30.94705  | 2000-01-01 00:00:00 | Holding  insert lock
 15445960 | 2016-04-17 10:38:19.712661 | 2016-04-17 17:02:30.726889 |  101405954 |          69501 | 2016-04-17 17:02:30.352144 | 2000-01-01 00:00:00 | Holding write lock
 15445960 | 2016-04-17 10:38:19.712661 | 2016-04-17 17:02:30.737587 |  101405954 |          69501 | 2016-04-17 17:02:30.352144 | 2000-01-01 00:00:00 | Holding  insert lock

Nope.

Step 3: I remembered reading that _some_ low-level locks only appeared in `pg_locks`, not in `stv_locks`.

warehouse=# select * from pg_locks;
 relation | database | transaction |  pid   |         mode          | granted
----------+----------+-------------+--------+-----------------------+---------
...some stuff deleted...
   102052 |   100153 |             |  75617 | AccessShareLock       | t
   629217 |   100153 |             |  75609 | AccessShareLock       | t
    16688 |   100153 |             |   9315 | RowExclusiveLock      | t
 15439762 |   100153 |             |  75593 | AccessShareLock       | t
          |          |   101406858 |  75613 | ExclusiveLock         | t
   629217 |   100153 |             |  75608 | AccessShareLock       | t
          |          |   101406889 |  75619 | ExclusiveLock         | t

Look! THAR SHE BLOWS. :whale:

Step 4: Narrowing it down…

warehouse=# select * from pg_locks where relation=(select oid from pg_class where relname = 'airdna_airbnb_properties');
 relation | database | transaction |  pid   |        mode         | granted
----------+----------+-------------+--------+---------------------+---------
 15345523 |   100153 |             |  23925 | AccessShareLock     | f
 15345523 |   100153 |             |  74134 | AccessShareLock     | f
 15345523 |   100153 |             |  75743 | AccessShareLock     | f
 15345523 |   100153 |             |   9315 | AccessExclusiveLock | f
 15345523 |   100153 |             |   1160 | AccessShareLock     | t
 15345523 |   100153 |             | 130419 | AccessShareLock     | t
 15345523 |   100153 |             |  69872 | AccessShareLock     | f
(7 rows)]]>

So–in fact, the lock that pid 9315 is in the queue, but it’s not granted because two other pids have a lock. NEITHER OF THESE SHOW UP IN A PROCESSLIST.

Step 5: Well, sez I, I cannot tell what pids 1160 nor 130419 are. I can either kill them and hope for the best, or restart the whole damned Redshift instance. Might as well try:

 

warehouse=# cancel 130419;
CANCEL
warehouse=# cancel 1160;
CANCEL

NOW pid 9315 has an actual lock:

warehouse=# select * from pg_locks where relation=(select oid from pg_class where relname = 'airdna_airbnb_properties');
 relation | database | transaction |  pid  |        mode         | granted
----------+----------+-------------+-------+---------------------+---------
 15345523 |   100153 |             | 23925 | AccessShareLock     | f
 15345523 |   100153 |             | 74134 | AccessShareLock     | f
 15345523 |   100153 |             | 75743 | AccessShareLock     | f
 15345523 |   100153 |             |  9315 | AccessExclusiveLock | t
 15345523 |   100153 |             | 64420 | AccessShareLock     | f
 15345523 |   100153 |             | 69872 | AccessShareLock     | f
(6 rows)

Step 6: :skull_and_crossbones: 9315 again:

 

warehouse=# select pid, duration/1000000 as seconds, trim(user_name) as user,substring(query,1,100) as querytxt from stv_recents where status = 'Running' and seconds >=1 order by seconds desc;
  pid  | seconds |       user       |                                               querytxt
-------+---------+------------------+------------------------------------------------------------------------------------------------------
 76659 |      62 | rsrootuser       | /*MTLN-1.37.4 (build 264)*/                                                                         +
       |         |                  | COPY "app1"."stg_ha_agent_performance" ("agent", "user_id", "inquiry_2
 69497 |      16 | rsrootuser       | UPDATE "marketo"."Leads" SET "id" = s."id", "firstName" = s."firstName", "middleName" = s."middleNam
 64420 |      13 | cavejohnsoniii   | select manager, city, bedrooms, date, avg(price_usd) price, destid, name, occ                       +
       |         |                  |  from                                                                                               +
       |         |                  |  (select Proper
 75854 |      11 | rsrootuser       | INSERT INTO "app1"."_etleap" (project, op, value) VALUES ('aFHvKIYp', 'L', 'fzPcxCcj')
 23925 |      11 | rsrootuser       | /*MTLN-1.37.4 (build 264)*/                                                                         +
       |         |                  | create table "analysis_ready"."airdna_propertymap" diststyle Even as sel
 75688 |       8 | rsrootuser       | INSERT INTO "app1"."_etleap" (project, op, value) VALUES ('TFcCkvCP', 'L', 'gjVcAqAl')