文件
文件	partitioning.hpp
	列分区 API。

枚举
枚举类	cudf::hash_id { cudf::HASH_IDENTITY = 0 , cudf::HASH_MURMUR3 }
	标识用于哈希分区的哈希函数。更多...

函数
std::pair< std::unique_ptr< table >, std::vector< size_type > >	cudf::partition (table_view const &t, column_view const &partition_map, size_type num_partitions, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
	根据 `partition_map` 指定的映射对 `t` 的行进行分区。更多...

std::pair< std::unique_ptr< table >, std::vector< size_type > >	cudf::hash_partition (table_view const &input, std::vector< size_type > const &columns_to_hash, int num_partitions, hash_id hash_function=hash_id::HASH_MURMUR3, uint32_t seed=DEFAULT_HASH_SEED, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
	将输入表中的行分区到多个输出表中。更多...

std::pair< std::unique_ptr< cudf::table >, std::vector< cudf::size_type > >	cudf::round_robin_partition (table_view const &input, cudf::size_type num_partitions, cudf::size_type start_partition=0, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
	轮询分区。更多...

详细描述

枚举类型文档

◆ hash_id

枚举 cudf::hash_id

强

标识用于哈希分区的哈希函数。

枚举值
HASH_IDENTITY	身份哈希函数，仅返回要哈希的键。
HASH_MURMUR3	Murmur3 哈希函数。

定义于文件 partitioning.hpp 的第 40 行。

函数文档

◆ hash_partition()

std::pair<std::unique_ptr<table>, std::vector<size_type> > cudf::hash_partition	(	table_view const &	input,
		std::vector< size_type > const &	columns_to_hash,
		int	num_partitions,
		hash_id	hash_function = `hash_id::HASH_MURMUR3`,
		uint32_t	seed = `DEFAULT_HASH_SEED`,
		rmm::cuda_stream_view	stream = `cudf::get_default_stream()`,
		rmm::device_async_resource_ref	mr = `cudf::get_current_device_resource_ref()`
	)

将输入表中的行分区到多个输出表中。

根据 `input` 中 `columns_to_hash` 指定列的哈希值，将行分区到 `num_partitions` 个 bin 中。分区到同一 bin 中的行在输出表中连续分组。返回一个向量，其中包含输出表中每个分区起始行的偏移量。

异常

std::out_of_range 如果 `columns_to_hash` 中的索引无效

参数

input	要分区的表
columns_to_hash	要进行哈希计算的输入列的索引
num_partitions	要使用的分区数
hash_function	可选的哈希 ID，用于选择要使用的哈希函数
seed	哈希函数的可选种子值
stream	用于设备内存操作和内核启动的 CUDA 流
mr	用于分配返回表的设备内存的设备内存资源

返回值: 一个输出表和指向每个分区行偏移量的向量

◆ partition()

std::pair<std::unique_ptr<table>, std::vector<size_type> > cudf::partition	(	table_view const &	t,
		column_view const &	分区映射,
		size_type	num_partitions,
		rmm::cuda_stream_view	stream = `cudf::get_default_stream()`,
		rmm::device_async_resource_ref	mr = `cudf::get_current_device_resource_ref()`
	)

根据 `partition_map` 指定的映射对 `t` 的行进行分区。

对于 `t` 中索引为 `i` 的每一行，`partition_map[i]` 指示该行属于哪个分区。`partition` 通过重新排列 `t` 的行创建一个新表，使得同一分区的行是连续的。返回的表按分区索引 `[0, num_partitions)` 升序排列。每个分区内的顺序是未定义的。

返回一个 `vector`，包含 `num_partitions + 1` 个值，指示返回表中每个分区的起始位置，即分区 `i` 从 `offsets[i]` 开始（包含）到 `offset[i+1]` 结束（不包含）。因此，如果 `[0, num_partitions)` 中的值 `j` 未出现在 `partition_map` 中，则分区 `j` 将为空，即 `offsets[j+1] - offsets[j] == 0`。

`partition_map` 中的值必须在 `[0, num_partitions)` 范围内，否则行为未定义。

异常

cudf::logic_error	当 `partition_map` 是非整型类型时
cudf::logic_error	当 `partition_map.has_nulls() == true` 时
cudf::logic_error	当 `partition_map.size() != t.num_rows()` 时

参数

t	要分区的表
分区映射	一个非可空整数值列，将 `t` 中的每一行映射到其分区。
num_partitions	总分区数
stream	用于设备内存操作和内核启动的 CUDA 流
mr	用于分配返回表的设备内存的设备内存资源

返回值: 包含重排后的表以及一个向量的对，向量包含 `num_partitions + 1` 个指向每个分区的偏移量，使得分区 `i` 的大小由 `offset[i+1] - offset[i]` 确定。

◆ round_robin_partition()

std::pair<std::unique_ptr<cudf::table>, std::vector<cudf::size_type> > cudf::round_robin_partition	(	table_view const &	input,
		cudf::size_type	num_partitions,
		cudf::size_type	start_partition = `0`,
		rmm::cuda_stream_view	stream = `cudf::get_default_stream()`,
		rmm::device_async_resource_ref	mr = `cudf::get_current_device_resource_ref()`
	)

轮询分区。

返回一个新表，其行被重新排列到分区组中，并返回一个向量，包含输出表中每个分区起始行的偏移量。行根据其在表中的行索引，以轮询方式分配到分区。

异常

cudf::logic_error	如果 `num_partitions <= 1`
cudf::logic_error	如果 `start_partition >= num_partitions`

该算法的一个很好的类比是发牌

牌堆表示为表中的行。
分区数即为发牌的玩家人数。
`start_partition` 指示哪个玩家先开始拿牌。

算法有两个结果

通过将每个玩家的牌重新堆叠成一个新的牌堆，从玩家 0 开始，保留发给每个玩家的牌的顺序。
一个指向输出牌堆的向量，指示玩家的牌从哪里开始。

玩家的牌堆（分区）是从相应偏移量开始，到下一个玩家起始偏移量结束的牌的范围，如果是最后一个玩家，则到牌堆的最后一张牌结束。

当 `num_partitions > nrows` 时，玩家多于牌。我们从指示的第一个玩家开始发牌，然后轮流给玩家发牌，直到牌发完，而玩家还没有完全拿到牌。没有拿到任何牌的玩家表示为 `offset[i] == offset[i+1]`，或者如果是最后一个玩家 `i == num_partitions-1`，则表示为 `offset[i] == table.num_rows()`，这意味着他们的牌堆（分区）中没有牌（行）。

示例 1
input
table => col 1 {0, ..., 12}
num_partitions = 3
start_partition = 0
 
输出: pair<table, partition_offsets>
table => col 1 {0,3,6,9,12,1,4,7,10,2,5,8,11}
partition_offsets => {0,5,9}
 
示例 2
input
table => col 1 {0, ..., 12}
num_partitions = 3
start_partition = 1
 
输出: pair<table, partition_offsets>
table => col 1 {2,5,8,11,0,3,6,9,12,1,4,7,10}
partition_offsets => {0,4,9}
 
示例 3
input
table => col 1 {0, ..., 10}
num_partitions = 3
start_partition = 0
 
输出: pair<table, partition_offsets>
table => col 1 {0,3,6,9,1,4,7,10,2,5,8}
partition_offsets => {0,4,8}
 
示例 4
input
table => col 1 {0, ..., 10}
num_partitions = 3
start_partition = 1
 
输出: pair<table, partition_offsets>
table => col 1 {2,5,8,0,3,6,9,1,4,7,10}
partition_offsets => {0,3,7}
 
示例 5
input
table => col 1 {0, ..., 10}
num_partitions = 3
start_partition = 2
 
输出: pair<table, partition_offsets>
table => col 1 {1,4,7,10,2,5,8,0,3,6,9}
partition_offsets => {0,4,7}
 
示例 6
input
table => col 1 {0, ..., 10}
num_partitions = 15 > num_rows = 11
start_partition = 2
 
输出: pair<table, partition_offsets>
table => col 1 {0,1,2,3,4,5,6,7,8,9,10}
partition_offsets => {0,0,0,1,2,3,4,5,6,7,8,9,10,11,11}
 
示例 7
input
table => col 1 {0, ..., 10}
num_partitions = 15 > num_rows = 11
start_partition = 10
 
输出: pair<table, partition_offsets>
table => col 1 {5,6,7,8,9,10,0,1,2,3,4}
partition_offsets => {0,1,2,3,4,5,6,6,6,6,6,7,8,9,10}
 
示例 8
input
table => col 1 {0, ..., 10}
num_partitions = 15 > num_rows = 11
start_partition = 14
 
输出: pair<table, partition_offsets>
table => col 1 {1,2,3,4,5,6,7,8,9,10,0}
partition_offsets => {0,1,2,3,4,5,6,7,8,9,10,10,10,10,10}
 
示例 9
input
table => col 1 {0, ..., 10}
num_partitions = 11 == num_rows = 11
start_partition = 2
 
输出: pair<table, partition_offsets>
table => col 1 {9,10,0,1,2,3,4,5,6,7,8}
partition_offsets => {0,1,2,3,4,5,6,7,8,9,10}

参数

[in]	input	要进行轮询分区的输入表
[in]	num_partitions	表的总分区数
[in]	start_partition	第一个分区的索引
[in]	stream	用于设备内存操作和内核启动的 CUDA 流
[in]	mr	用于分配返回表的设备内存的设备内存资源

返回值: 一个 std::pair，包含一个指向分区表的 unique_ptr 和表中每个分区的偏移量。