Context

emptyRDD(): RDD<never>

Create an RDD that has no partitions or elements.

range(from: number, to?: number, step?: number = 1, numPartitions?: number): RDD<number>

Create a new RDD of int containing elements from start to end (exclusive), increased by step every element. If called with a single argument, the argument is interpreted as end, and start is set to 0.

Parameters:

  • start – the start value

  • end – the end value(exclusive)

  • step – the incremental step (default: 1)

  • numPartitions – the number of partitions of the new RDD

Returns:

An RDD of numbers

> dcc.range(5).collect()
[ 0, 1, 2, 3, 4 ]
> dcc.range(2, 4).collect()
[ 2, 3 ]
> dcc.range(1, 7, 2).collect()
[ 1, 3, 5 ]
> dcc.range(1, 7, 1.5).collect()
[ 1, 2.5, 4, 5.5 ]

Difference from Spark:

Parameter step can be a double number, but this may cause precision loss and leads to some unexpected results:

> dcc.range(1, 1.3, 0.1).collect()
[ 1, 1.1, 1.2, 1.3 ]

parallelize<T>(c: T[], numPartitions?: number): RDD<T>

Distribute a local javascript collection to form an RDD.

union<T>(...rdds: RDD<T>[]): RDD<T>

Build the union of a list of RDDs.

> dcc.union(dcc.parallelize(['Hello']), dcc.parallelize(['World'])).collect()
[ 'Hello', 'World' ]

binaryFiles(baseUrl: string, options?: { recursive?: boolean }): RDD<[string, Buffer]>

Read a directory of binary files from any file system (must be available on all nodes). Each file is read as a single record and returned in a key-value pair, where the key is the path of each file(relative from baseUrl), the value is the content of each file.

Small files are preferred, as each file will be loaded fully in memory.

> dcc.binaryFiles('./src', true).map(v => [v[0], v[1].length]).collect()
[ [ 'cli\\index.ts', 1220 ],
  [ 'client\\Client.ts', 280 ],
  ... ]

Difference from Spark:

  • Each file will become a single partition. If you need, you can call repartition() after file was loaded.

  • DCF support recursive loads, which will load all files in directory and sub-directory.

wholeTextFiles(baseUrl: string, options?: Options): RDD<[string, string]>

Read a directory of text files from any file system (must be available on all nodes). Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.

Options

  • encoding?:string – encoding of text file(s) (default: 'utf8')

  • recursive?:boolean – include file(s) in sub-directory(s)

  • decompressor?: (data: Buffer, filename: string)=>Buffer

    • provide a decompressor function

  • functionEnv?: FunctionEnv – function serialize context(upvalues)

Small files are preferred, as each file will be loaded fully in memory.

Difference from Spark:

  • Each file will become a single partition. If you need, you can call repartition() after file was loaded.

  • DCF support recursive loads, which will load all files in directory and sub-directory.

  • Decompressor is provided as a function instead of a configure.

textFile(baseUrl: string, options?: Options): RDD<string>

Read a single text file or a directory of text files from any file system (must be available on all nodes). And return a RDD with each line of them.

Options

  • encoding?:string – encoding of text file(s) (default: 'utf8')

  • recursive?:boolean – include file(s) in sub-directory(s)

  • decompressor?: (data: Buffer, filename: string)=>Buffer

    • provide a decompressor function

  • functionEnv?: FunctionEnv – function serialize context(upvalues)

  • __dangerousDontCopy?: boolean - Do not copy each line, keep reference of the huge string.

About copy: V8 has a optimize of string slice/split, which don't copy string content while slice, but keep reference and slice info instead. This may lead to memory leak if you use any reduce based on that slice, like take(1). See https://bugs.chromium.org/p/v8/issues/detail?id=2869

So DCF will copy every slice to avoid this problem. But if you dont use any reduce based on that slice, you can skip that copy without any leak. Use __dangerousDontCopy: true in this case.

Last updated