Data Serializing

Data in RDD may be serialized in serveral cases:

  1. To persist data on disk or off-heap memory.

  2. To transfer data for repartition

  3. To transfer result to master & client

  4. To get a hash value for combineByKey/distinct

Serialize will comes with two method:

V8.serialize()/V8.deserialize()

For case 1-3, data will be serialized and deserialized later. V8.serialize() is the best function for this. It can deal many type of data:

  1. primitive types: undefined, null, number, boolean, string, Symbol

  2. array, object literal

  3. Date, RegExp, Set, Map

  4. Buffer/typed arrays

So in generally speaking you could use any type except custom classes.

Key Serialize

But there's a problem for V8.serialize: sometimes it gives different result for same input, especially for splitted string. So when we need another serialize function for combineByKey/distinct.

The function may change in future, but these types are guaranteed safe as key types:

  1. primitive types except Symbol: null, number, boolean, string

  2. array, object literal

Current version of dcf use a custom version of msgpack5 for key serialize, which support undefined value.

Last updated