郑少群博客: Bitmap 索引 vs. B

Bitmap 索引 vs. B

——理解每个索引的适当使用对性能的影响

本文内容

比较索引
演示环境
步骤 1A
步骤 1B
步骤 2A
步骤 2B

传统的观点认为，Bitmap 索引更适合具有较低不重复（distinct）值的列——如性别、状态和关系。然而，这个假设并不完全。实际中，Bitmap 索引对系统总是有个建议，数据不会被很多并发系统频繁更新。事实上，正如我想说的，一个具有 100% 唯一值（unique values）列（该列可以作为主键）上的 Bitmap 索引具有与 B-tree 索引同样的效率。

本文，我将提供一些例子及其执行计划，这些例子有两种索引类型，低基数列和高基数列。这些例子将帮助 DBA 理解 Bitmap 索引的使用并不是基数依赖，而是应用依赖。

另外，本文是基于一篇 2005 年的英文文章，表格数据是英文原来的，其他的是，自己在本机测试的结果。

有意思的是，英文的表格数据，物理 IO 有不为 0 的值，可是我在测试中，尽管也是一百万的数据，却一直是 0。我的理解是，尽管数据多，但是总体量不大，而且硬件的发展关系很大。

比较索引

在唯一值列上使用 Bitmap 索引有很多不利之处——其中一个就是需要足够的空间（Oracle 也不推荐这样做）。然而，Bitmap 索引的大小依赖索引列的基数和数据分布。因此，GENDER 列的 Bitmap 索引要比其 B-tree 索引小。相反，EMPNO 列（可作为主键列）的 Bitmap 索引要比其 B-tree 索引大很多。但因为，访问决策支持系统（decision-support systems，DSS）的用户比访问事务处理系统（transaction-processing，OLTP）要少很多（毕竟只有高层领导才会用）。对这些应用程序来说，资源不是问题。

为了说明这个观点，我创建两个表，TEST_NORMAL 和 TEST_RANDOM，插入 1000000 条数据。插入 TEST_NORMAL 表的数据是按顺序的，而 TEST_RANDOM 表的数据是根据 TEST_NORMAL 表数据，随机追加的。

Create table test_normal (empno number(10), ename varchar2(30), sal number(10));

Begin

  For i in 1 .. 1000000 Loop

    Insert into test_normal

    values

      (i, dbms_random.string('U', 30), dbms_random.value(1000, 7000));

    If mod(i, 10000) = 0 then

      Commit;

    End if;

  End loop;

End;

Create table test_random

as

select /*+ append */

  from test_normal

 order by dbms_random.random;

SQL> select count(*) "Total Rows" from test_normal;

Total Rows

----------

SQL> select count(distinct empno) "Distinct Values" from test_normal;

Distinct Values

---------------

SQL> select count(*) "Total Rows" from test_random;

Total Rows

----------

SQL> select count(distinct empno) "Distinct Values" from test_random;

Distinct Values

---------------

SQL>

注意：TEST_NORMAL 表的数据是已有组织，而 TEST_RANDOM 表的数据是根据 TEST_NORMAL 表随机追加的，因此是非组织的。如下所示，选择最前的两行数据。两个表的 EMPNO 列值 100% 不重复，可以作为表主键。如果定义该列为主键，那么你只能为该列创建 B-tree 索引，不能是 Bitmap，因为，Oracle 不支持 Bitmap 主键索引。
SQL> select * from test_normal where rownum<=2;
 
      EMPNO ENAME                                  SAL
----------- ------------------------------ -----------
          1 MNSUXLUHXALBKPRABKEOQLVMMVGBQF        1669
          2 QKCBLPNSRVSEVDIDTMJOIFQTYJZVOQ        6476
 
SQL> select * from test_random where rownum<=2;
 
      EMPNO ENAME                                  SAL
----------- ------------------------------ -----------
     380959 MDWFHKQEVYAUYBLGYPIFBDJDKPPCEZ        4287
     222486 ELOBFRHYVLXZRUVOSMTGBERRVARJSL        3601
 
SQL> 

为了分析这些索引的行为，我们将执行如下步骤：

1，在 TEST_NORMAL 表上

    A - 在 EMPNO 列上创建 bitmap 索引，执行一些带等值谓词的查询。

    B - 在 EMPNO 列上创建 B-tree 索引，执行一些带等值谓词的查询，并比较查询的逻辑 IO 和物理 IO。

2，在 TEST_RANDOM 表上

    A - 同步骤 1A。

    B - 同步骤 1B。

3，在 TEST_NORMAL 表上

    A - 同步骤 1A，但执行带范围的查询

    B - 同步骤 1B，但执行带范围的查询。然后比较统计信息。

4，在 TEST_RANDOM 表上

    A - 同步骤 3A。

    B - 同步骤 3B。

5，在 TEST_NORMAL 表上

    A - 在 SAL 列创建 bitmap 索引，然后执行一些等值谓词和范围谓词的查询。

    B - 在 SAL 列创建 B-tree 索引，然后执行一些等值谓词和范围谓词的查询（同 5A 一样）。比较这些查询 IO。

6，向两个表添加 GENDER 列，并用三个可能的值更新：M、F 和 null。更新是基于某个条件。

7，在 GENDER 列创建 bitmap 索引，然后执行一些等值谓词。

8，在 GENDER 列创建 B-tree 索引，然后执行一些等值谓词。比较步骤 7 的结果。

步骤 1 到 4 会涉及到高基数（100% 不重复）列，而步骤 5 是一个正常基数，步骤 7 和 8 是低基数列。

演示环境

Windows 7 64 位旗舰版操作系统
Oracle 11g Release 2 (11.2)
Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz - 双物理内核四线程
4G 内存

步骤 1A（在 TEST_NORMAL）

在该步骤，我们将在 TEST_NORMAL 表上创建 bitmap 索引，然后检查一下索引的大小及其的簇因素，表的大小。最后执行一些等值谓词，并注意利用 bitmap 索引时这些查询的 IO。

SQL> create bitmap index normal_empno_bmx on test_normal(empno);

Index created

SQL> analyze table test_normal compute statistics for table for all indexes for all indexed columns;

Table analyzed

SQL> select substr(segment_name,1,30) segment_name, bytes/1024/1024 "Size in MB"

  2  from user_segments

  3  where segment_name in ('TEST_NORMAL','NORMAL_EMPNO_BMX');

SEGMENT_NAME                                                 Size in MB

------------------------------------------------------------ ----------

NORMAL_EMPNO_BMX                                                     28

TEST_NORMAL                                                          50

SQL> select index_name, clustering_factor from user_indexes;

INDEX_NAME                     CLUSTERING_FACTOR

------------------------------ -----------------

NORMAL_EMPNO_BMX                         1000000

SQL>

你可以看到，索引的大小为 28 MB，簇因素等于表的行数。现在，执行一些等值谓词：

SQL> set autot traceonly

SQL> select * from test_normal where empno=&empno;

输入 empno 的值:  1000

原值    1: select * from test_normal where empno=&empno

新值    1: select * from test_normal where empno=1000

执行计划

----------------------------------------------------------

Plan hash value: 4267925254

-------------------------------------------------------------------------------------------------

| Id  | Operation                    | Name             | Rows  | Bytes | Cost (%CPU)| Time     |

-------------------------------------------------------------------------------------------------

|   0 | SELECT STATEMENT             |                  |     1 |    34 |     3  (0)| 00:00:01 |

|   1 |  TABLE ACCESS BY INDEX ROWID | TEST_NORMAL      |     1 |    34 |     3  (0)| 00:00:01 |

|   2 |   BITMAP CONVERSION TO ROWIDS|                  |       |       |           |          |

|*  3 |    BITMAP INDEX SINGLE VALUE | NORMAL_EMPNO_BMX |       |       |           |          |

-------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

   3 - access("EMPNO"=1000)

统计信息

----------------------------------------------------------

          0  recursive calls

          0  db block gets

          5  consistent gets

          0  physical reads

          0  redo size

        695  bytes sent via SQL*Net to client

        519  bytes received via SQL*Net from client

          2  SQL*Net roundtrips to/from client

          0  sorts (memory)

          0  sorts (disk)

          1  rows processed

SQL>

步骤 1B（在 TEST_NORMAL）

现在，删除 bitmap 索引，在 EMPNO 列上创建 B-tree 索引。跟步骤 1A 一样，检查索引大小及其簇因素，并执行同样的操作，比较它们的 IO。

SQL> drop index normal_empno_bmx;

索引已删除。

SQL> create index normal_empno_idx on test_normal(empno);

索引已创建。

SQL> analyze table test_normal compute statistics for table for all indexes for

all indexed columns;

表已分析。

SQL> select substr(segment_name,1,30) segment_name, bytes/1024/1024 "Size in MB"

  2  from user_segments

  3  where segment_name in ('TEST_NORMAL','NORMAL_EMPNO_IDX');

SEGMENT_NAME                                                 Size in MB

------------------------------------------------------------ ----------

NORMAL_EMPNO_IDX                                                     18

TEST_NORMAL                                                          50

SQL> select index_name, clustering_factor from user_indexes;

INDEX_NAME                     CLUSTERING_FACTOR

------------------------------ -----------------

NORMAL_EMPNO_IDX                            6210

SQL>

显然，在改表上，EMPNO 列的 B-tree 索引比 bitmap 索引小。B-tree 索引的簇因素更接近表中数据块的数量。这样，B-tree 索引对范围谓词查询更有效。现在，使用 B-tree 索引执行同样的查询。

SQL> set autot traceonly

SQL> select * from test_normal where empno=&empno;

输入 empno 的值:  1000

原值    1: select * from test_normal where empno=&empno

新值    1: select * from test_normal where empno=1000

执行计划

----------------------------------------------------------

Plan hash value: 1781697849

------------------------------------------------------------------------------------------------

| Id  | Operation                   | Name             | Rows  | Bytes | Cost (%CPU)| Time     |

------------------------------------------------------------------------------------------------

|   0 | SELECT STATEMENT            |                  |     1 |    34 |     4 (0)| 00:00:01 |

|   1 |  TABLE ACCESS BY INDEX ROWID| TEST_NORMAL      |     1 |    34 |     4 (0)| 00:00:01 |

|*  2 |   INDEX RANGE SCAN          | NORMAL_EMPNO_IDX |     1 |       |     3 (0)| 00:00:01 |

------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

   2 - access("EMPNO"=1000)

统计信息

----------------------------------------------------------

          1  recursive calls

          0  db block gets

          5  consistent gets

          0  physical reads

          0  redo size

        695  bytes sent via SQL*Net to client

        519  bytes received via SQL*Net from client

          2  SQL*Net roundtrips to/from client

          0  sorts (memory)

          0  sorts (disk)

          1  rows processed

SQL>

正如你看到的，当执行不同 EMPNO 值查询时，consistent gets 和 physical reads 的数量对 Bitmap 索引和 B-tree 索引在 100% 不重复列上是一致的。

Bitmap			B-tree
consistent gets	physical reads	EMPNO	consistent gets	physical reads
5	0	1000	5	0
5	2	2398	5	2
5	2	8545	5	2
5	2	98008	5	2
5	2	85342	5	2
5	2	128444	5	2
5	2	858	5	2

步骤 2A（在 TEST_RANDOM）

现在，执行在 TEST_RANDOM 表上进行同样的实验：

SQL> create bitmap index random_empno_bmx on test_random(empno);

索引已创建。

SQL> analyze table test_random compute statistics for table for all indexes for

all indexed columns;

表已分析。

SQL> select substr(segment_name,1,30) segment_name, bytes/1024/1024 "Size in MB"

  2  from user_segments

  3  where segment_name in ('TEST_RANDOM','RANDOM_EMPNO_BMX');

SEGMENT_NAME                                                 Size in MB

------------------------------------------------------------ ----------

RANDOM_EMPNO_BMX                                                     28

TEST_RANDOM                                                          50

SQL> select index_name, clustering_factor from user_indexes;

INDEX_NAME                     CLUSTERING_FACTOR

------------------------------ -----------------

RANDOM_EMPNO_BMX                         1000000

NORMAL_EMPNO_IDX                            6210

SQL>

统计信息（索引大小及其簇因素）同 TEST_NORMAL 表上的那些索引一致。

SQL> set autot traceonly

SQL> select * from test_random where empno=&empno;

输入 empno 的值:  1000

原值    1: select * from test_random where empno=&empno

新值    1: select * from test_random where empno=1000

执行计划

----------------------------------------------------------

Plan hash value: 3133668422

-------------------------------------------------------------------------------------------------

| Id  | Operation                    | Name             | Rows  | Bytes | Cost (%CPU)| Time     |

-------------------------------------------------------------------------------------------------

|   0 | SELECT STATEMENT             |                  |     1 |    39 |     3  (0)| 00:00:01 |

|   1 |  TABLE ACCESS BY INDEX ROWID | TEST_RANDOM      |     1 |    39 |     3  (0)| 00:00:01 |

|   2 |   BITMAP CONVERSION TO ROWIDS|                  |       |       |           |          |

|*  3 |    BITMAP INDEX SINGLE VALUE | RANDOM_EMPNO_BMX |       |       |           |          |

-------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

   3 - access("EMPNO"=1000)

统计信息

----------------------------------------------------------

          0  recursive calls

          0  db block gets

          5  consistent gets

          0  physical reads

          0  redo size

        695  bytes sent via SQL*Net to client

        519  bytes received via SQL*Net from client

          2  SQL*Net roundtrips to/from client

          0  sorts (memory)

          0  sorts (disk)

          1  rows processed

SQL>

步骤 2B（在 TEST_RANDOM）

与步骤 1B 相比，我们现在删除 bitmap 索引，在 EMPNO 列上创建 B-tree 索引。

SQL> drop index RANDOM_EMPNO_BMX;

索引已删除。

SQL> create index random_empno_idx on test_random(empno);

索引已创建。

SQL> analyze table test_random compute statistics for table for all indexes for

all indexed columns;

表已分析。

SQL> select substr(segment_name,1,30) segment_name, bytes/1024/1024 "Size in MB"

  2  from user_segments

  3  where segment_name in ('TEST_RANDOM','RANDOM_EMPNO_IDX');

SEGMENT_NAME                                                 Size in MB

------------------------------------------------------------ ----------

RANDOM_EMPNO_IDX                                                     18

TEST_RANDOM                                                          50

SQL> select index_name, clustering_factor from user_indexes;

INDEX_NAME                     CLUSTERING_FACTOR

------------------------------ -----------------

RANDOM_EMPNO_IDX                          999823

NORMAL_EMPNO_IDX                            6210

SQL>

索引大小等于 TEST_NORMAL 表上的索引大小，但是簇因素很大，接近表行数，这使得该索引对范围谓词查询效率不高（正如步骤 4 看到的）。簇因素不会影响等值谓词查询，因为数据行具有 100% 不重复值，每个键的行数为 1。

现在执行等值谓词查询。

SQL> set autot traceonly

SQL> select * from test_random where empno=&empno;

输入 empno 的值:  1000

原值    1: select * from test_random where empno=&empno

新值    1: select * from test_random where empno=1000

执行计划

----------------------------------------------------------

Plan hash value: 3101594612

------------------------------------------------------------------------------------------------

| Id  | Operation                   | Name             | Rows  | Bytes | Cost (%CPU)| Time     |

------------------------------------------------------------------------------------------------

|   0 | SELECT STATEMENT            |                  |     1 |    39 |     4 (0)| 00:00:01 |

|   1 |  TABLE ACCESS BY INDEX ROWID| TEST_RANDOM      |     1 |    39 |     4 (0)| 00:00:01 |

|*  2 |   INDEX RANGE SCAN          | RANDOM_EMPNO_IDX |     1 |       |     3 (0)| 00:00:01 |

------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

   2 - access("EMPNO"=1000)

统计信息

----------------------------------------------------------

          1  recursive calls

          0  db block gets

          5  consistent gets

          0  physical reads

          0  redo size

        695  bytes sent via SQL*Net to client

        519  bytes received via SQL*Net from client

          2  SQL*Net roundtrips to/from client

          0  sorts (memory)

          0  sorts (disk)

          1  rows processed

SQL>

另外，结果与步骤 1A 和 1B 一致。数据的分布对唯一列不会影响 consistent gets 和 physical reads。

TAG:

2012年11月11日星期日

Bitmap 索引 vs. B

Bitmap 索引 vs. B

本文内容

比较索引

演示环境

步骤 1A（在 TEST_NORMAL）

步骤 1B（在 TEST_NORMAL）

步骤 2A（在 TEST_RANDOM）

步骤 2B（在 TEST_RANDOM）