RDD<T>
A Resilient Distributed Dataset (RDD), the basic abstraction in DCF. Represents an immutable, partitioned collection of elements that can be operated on in parallel.
persist(storageType?: StorageType): CacheRDD<T>
Persist this rdd into storage after the first time it is computed. You should manually unpersist()
it later. If no storage level is specified, defaults to 'memory'.
cache(): CachedRDD<T>
Persist this RDD with the default storage level ('memory').
getNumPartitions: Promise<number>
Returns the number of partitions in RDD, skip calculating if possible.
collect(): Promise<T[]>
Return a list that contains all of the elements in this RDD.
take(count: number): Promise<T[]>
Take the first num elements of the RDD.
count(): Promise<number>
Return the number of elements in this RDD.
max(): Promise<T | null>
Find the maximum item in this RDD. If the RDD is empty, returns null.
min(): Promise<T | null>
Find the minimum item in this RDD. If the RDD is empty, returns null.
mapPartitions<T1>(func: (v: T[]) => T1[] | Promise<T1[]>, env?: FunctionEnv): RDD<T1>
Return a new RDD by applying a function to each partition of this RDD.
glom(): RDD<T[]>
Returns an RDD created by coalescing all elements within each partition into a list.
map<T1>(func: ((v: T) => T1), env?: FunctionEnv): RDD<T1>
Return a new RDD by applying a function to each element of this RDD.
reduce(func: ((a: T, b: T) => T), env?: FunctionEnv): Promise<T | null>
Reduces the elements of this RDD using the specified commutative and associative binary operator, returns the final result.
If the input RDD is empty, returns null.
flatMap<T1>(func: ((v: T) => T1[]), env?: FunctionEnv): RDD<T1>
Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.
filter(func: (v: T) => boolean, env?: FunctionEnv): RDD<T>
Return a new RDD containing only the elements that satisfy a predicate.
distinct(numPartitions?: number): RDD<T>
Return a new RDD containing the distinct elements in this RDD.
repartition(numPartitions: number): RDD<T>
Return a new RDD that has exactly numPartitions partitions.
Can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data. If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid performing a full shuffle.
partitionBy(numPartitions: number, partitionFunc: (v: T) => number, env?: FunctionEnv): GeneratedRDD<T>
Return a copy of the RDD partitioned using the specified partitioner.
coalesce(numPartitions: number): GeneratedRDD<T>
Return a new RDD that is reduced into numPartitions partitions, but keep in orders.
reduceByKey<K, V>(this: RDD<[K, V]>, func: ((a: V, B: V) => V), numPartitions?: number, partitionFunc?: (v: K) => number, env?: FunctionEnv, ): RDD<[K, V]>
Merge the values for each key using an associative and commutative reduce function.
This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a “combiner” in MapReduce.
Output will be partitioned with numPartitions
partitions, or the default parallelism level if numPartitions
is not specified. Default partitioner is hash-partition.
combineByKey<K, V, C>( this: RDD<[K, V]>, createCombiner: ((a: V) => C), mergeValue: ((a: C, b: V) => C), mergeCombiners: ((a: C, b: C) => C), numPartitions?: number, partitionFunc?: (v: K) => number, env?: FunctionEnv, ): RDD<[K, C]>
Generic function to combine the elements for each key using a custom set of aggregation functions.
Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a “combined type” C.
Users provide three functions:
createCombiner
, which turns a V into a C (e.g., creates a one-element list)mergeValue
, to merge a V into a C (e.g., adds it to the end of a list)mergeCombiners
, to combine two C’s into a single one (e.g., merges the lists)
To avoid memory allocation, both mergeValue and mergeCombiners are allowed to modify and return their first argument instead of creating a new C.
In addition, users can control the partitioning of the output RDD.
union(...others: RDD<T>[]): RDD<T>
Return the union of this RDD and others.
saveAsTextFile(baseUrl: string, options?: Options): Promise<void>
Save this RDD as a text file, using string representations of elements.
Options
overwrite?:boolean – if should dcf erase any existing file.
encoding?:string – encoding of text file(s) (default: 'utf8')
extension?: string – extension of each partition (default: 'txt')
compressor?: (data: Buffer)=>Buffer
provide a compressor function
functionEnv?: FunctionEnv – function serialize context(upvalues)
groupWith<K, V, ...V1>(this: RDD<[K, V]>, ...others: RDD<[K, ...V1]>): RDD<[K, [V[], ...V1[]]]>
Alias for cogroup but with support for multiple RDDs.
cogroup<K, V, V1>(this: RDD<[K, V]>, other: RDD<[K, V1]>, numPartitions?: number): RDD<[K, [V[], V1[]]]>
For each key k in self
or other
, return a resulting RDD that contains a tuple with the list of values for that key in self
as well as other
.
join<K, V, V1>(this: RDD<[K, V]>, other: RDD<[K, V1]>, numPartitions?: number): RDD<[K, [V, V1]]>
Return an RDD containing all pairs of elements with matching keys in this
and other
.
Each pair of elements will be returned as a [k, [v1, v2]] tuple, where [k, v1] is in this
and [k, v2] is in other
.
Performs a hash join across the cluster.
leftOuterJoin<K, V, V1>(this: RDD<[K, V]>, other: RDD<[K, V1]>, numPartitions?: number): RDD<[K, [V, V1]]>
Perform a left outer join of this
and other
.
For each element [k, v] in this
, the resulting RDD will either contain all pairs [k, [v, w]] for w in other
, or the pair [k, [v, null]] if no elements in other
have key k.
Hash-partitions the resulting RDD into the given number of partitions.
rightOuterJoin<K, V, V1>(this: RDD<[K, V]>, other: RDD<[K, V1]>, numPartitions?: number): RDD<[K, [V, V1]]>
Perform a right outer join of this
and other
.
For each element [k, w] in other
, the resulting RDD will either contain all pairs [k, [v, w]] for v in this
, or the pair [k, [null, w]] if no elements in this
have key k.
Hash-partitions the resulting RDD into the given number of partitions.
fullOuterJoin<K, V, V1>(this: RDD<[K, V]>, other: RDD<[K, V1]>, numPartitions?: number): RDD<[K, [V, V1]]>
Perform a right outer join of this
and other
.
For each element [k, v] in this
, the resulting RDD will either contain all pairs [k, [v, w]] for w in other
, or the pair [k, [v, null]] if no elements in other
have key k.
Similarly, for each element [k, w] in other
, the resulting RDD will either contain all pairs [k, [v, w]] for v in this
, or the pair [k, [null, w]] if no elements in this
have key k.
Hash-partitions the resulting RDD into the given number of partitions.
sort(ascending?: boolean, numPartitions?: number): RDD<T>
Sorts this RDD.
sortBy<K>(keyFunc: (data: T) => K, ascending?: boolean, numPartitions?: number, env?: FunctionEnv): RDD<T>
Sorts this RDD by the given keyfunc
CachedRDD<T> extends RDD<T>
unpersist(): Promise<void>
Remove cached rdd from memory and disk.
Last updated