重新排序分区#

group 分区

枚举

enum class hash_id#

标识用于哈希分区的哈希函数。

取值

enumerator HASH_IDENTITY#: 恒等哈希函数，简单地返回要进行哈希的键。

enumerator HASH_MURMUR3#: Murmur3 哈希函数。

函数

std::pair<std::unique_ptr<table>, std::vector<size_type>> partition(table_view const &t, column_view const &partition_map, size_type num_partitions, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

根据 partition_map 指定的映射对 t 的行进行分区。

对于 t 中索引为 i 的每一行，partition_map[i] 表示行 i 所属的分区。partition 通过重新排列 t 的行来创建新表，使得同一分区的行是连续的。返回的表按照 [0, num_partitions) 的升序分区顺序排列。每个分区内的顺序未定义。

返回一个包含 num_partitions + 1 个值的 vector<size_type>，表示返回表中每个分区的起始位置，即分区 i 从 offsets[i] 开始（包含），到 offset[i+1] 结束（不包含）。因此，如果 [0, num_partitions) 中的值 j 未出现在 partition_map 中，则分区 j 将为空，即 offsets[j+1] - offsets[j] == 0。

partition_map 中的值必须在范围 [0, num_partitions) 内，否则行为未定义。

抛出:

cudf::logic_error – 当 partition_map 为非整数类型时
cudf::logic_error – 当 partition_map.has_nulls() == true 时
cudf::logic_error – 当 partition_map.size() != t.num_rows() 时

参数:

t – 要分区的表
partition_map – 非空整型列，用于将 t 中的每一行映射到其所属分区。
num_partitions – 分区的总数
stream – 用于设备内存操作和内核启动的 CUDA 流
mr – 用于分配返回表的设备内存的设备内存资源

返回:

包含重新排序后的表和包含 num_partitions + 1 个值的分区偏移量向量的 Pair，其中分区 i 的大小由 offset[i+1] - offset[i] 确定。

std::pair<std::unique_ptr<table>, std::vector<size_type>> hash_partition(table_view const &input, std::vector<size_type> const &columns_to_hash, int num_partitions, hash_id hash_function = hash_id::HASH_MURMUR3, uint32_t seed = DEFAULT_HASH_SEED, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

将输入表的行分区到多个输出表中。

根据 columns_to_hash 指定的列的哈希值，将 input 的行分区到 num_partitions 个桶中。分区到同一桶的行在输出表中连续分组。返回一个向量，包含输出表中每个分区起始行的偏移量。

抛出:

std::out_of_range – 如果 columns_to_hash 中的索引无效

参数:

input – 要分区的表
columns_to_hash – 要进行哈希处理的输入列的索引
num_partitions – 要使用的分区数量
hash_function – 可选的哈希 ID，选择要使用的哈希函数
seed – 哈希函数的可选种子值
stream – 用于设备内存操作和内核启动的 CUDA 流
mr – 用于分配返回表的设备内存的设备内存资源

返回:

一个输出表以及每个分区行偏移量的向量

std::pair<std::unique_ptr<cudf::table>, std::vector<cudf::size_type>> round_robin_partition(table_view const &input, cudf::size_type num_partitions, cudf::size_type start_partition = 0, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

轮询分区。

返回一个新表，其中行被重新排列到分区组中，以及一个向量，包含输出表中每个分区起始行的偏移量。行根据其在表中的行索引以轮询方式分配到分区。

这个算法的一个很好的类比是发牌

牌堆代表表中的行。
分区的数量是被发牌的玩家数量。
start_partition 指示哪个玩家首先开始获得牌。

该算法有两个结果

通过将每个玩家的牌再次叠回一个牌堆形成的另一副牌，保留发给每个玩家的牌的顺序，从玩家 0 开始。
一个指向输出牌堆的向量，指示一个玩家的牌从哪里开始。

一个玩家的牌堆（分区）是开始于相应偏移量并结束于下一个玩家的起始偏移量（如果是最后一个玩家则结束于牌堆的最后一张牌）的牌的范围。

当 num_partitions > nrows 时，玩家比牌多。我们开始发牌给指定的第一个玩家，并继续轮流发牌，直到牌发完为止，而此时玩家可能还没轮完。没有拿到任何牌的玩家由 offset[i] == offset[i+1] 表示，如果 i == num_partitions-1 则是 offset[i] == table.num_rows()，这意味着他们的牌堆（分区）中没有牌（行）。


Example 1:
input:
table => col 1 {0, ..., 12}
num_partitions = 3
start_partition = 0

output: pair<table, partition_offsets>
table => col 1 {0,3,6,9,12,1,4,7,10,2,5,8,11}
partition_offsets => {0,5,9}

Example 2:
input:
table => col 1 {0, ..., 12}
num_partitions = 3
start_partition = 1

output: pair<table, partition_offsets>
table => col 1 {2,5,8,11,0,3,6,9,12,1,4,7,10}
partition_offsets => {0,4,9}

Example 3:
input:
table => col 1 {0, ..., 10}
num_partitions = 3
start_partition = 0

output: pair<table, partition_offsets>
table => col 1 {0,3,6,9,1,4,7,10,2,5,8}
partition_offsets => {0,4,8}

Example 4:
input:
table => col 1 {0, ..., 10}
num_partitions = 3
start_partition = 1

output: pair<table, partition_offsets>
table => col 1 {2,5,8,0,3,6,9,1,4,7,10}
partition_offsets => {0,3,7}

Example 5:
input:
table => col 1 {0, ..., 10}
num_partitions = 3
start_partition = 2

output: pair<table, partition_offsets>
table => col 1 {1,4,7,10,2,5,8,0,3,6,9}
partition_offsets => {0,4,7}

Example 6:
input:
table => col 1 {0, ..., 10}
num_partitions = 15 > num_rows = 11
start_partition = 2

output: pair<table, partition_offsets>
table => col 1 {0,1,2,3,4,5,6,7,8,9,10}
partition_offsets => {0,0,0,1,2,3,4,5,6,7,8,9,10,11,11}

Example 7:
input:
table => col 1 {0, ..., 10}
num_partitions = 15 > num_rows = 11
start_partition = 10

output: pair<table, partition_offsets>
table => col 1 {5,6,7,8,9,10,0,1,2,3,4}
partition_offsets => {0,1,2,3,4,5,6,6,6,6,6,7,8,9,10}

Example 8:
input:
table => col 1 {0, ..., 10}
num_partitions = 15 > num_rows = 11
start_partition = 14

output: pair<table, partition_offsets>
table => col 1 {1,2,3,4,5,6,7,8,9,10,0}
partition_offsets => {0,1,2,3,4,5,6,7,8,9,10,10,10,10,10}

Example 9:
input:
table => col 1 {0, ..., 10}
num_partitions = 11 == num_rows = 11
start_partition = 2

output: pair<table, partition_offsets>
table => col 1 {9,10,0,1,2,3,4,5,6,7,8}
partition_offsets => {0,1,2,3,4,5,6,7,8,9,10}



抛出:

cudf::logic_error – 如果 num_partitions <= 1
cudf::logic_error – 如果 start_partition >= num_partitions


参数:

input – [in] 要进行轮询分区的输入表
num_partitions – [in] 表的分区数量
start_partition – [in] 第一个分区的索引
stream – [in] 用于设备内存操作和内核启动的 CUDA 流
mr – [in] 用于分配返回表的设备内存的设备内存资源


返回:
一个 std::pair，包含指向分区表的 unique_ptr 以及表中每个分区的偏移量。

重新排序分区#

当前页面