Data Serializing
Data in RDD may be serialized in serveral cases:
To persist data on disk or off-heap memory.
To transfer data for repartition
To transfer result to master & client
To get a hash value for combineByKey/distinct
Serialize will comes with two method:
V8.serialize()/V8.deserialize()
For case 1-3, data will be serialized and deserialized later. V8.serialize() is the best function for this. It can deal many type of data:
primitive types: undefined, null, number, boolean, string, Symbol
array, object literal
Date, RegExp, Set, Map
Buffer/typed arrays
So in generally speaking you could use any type except custom classes.
Key Serialize
But there's a problem for V8.serialize: sometimes it gives different result for same input, especially for splitted string. So when we need another serialize function for combineByKey/distinct.
The function may change in future, but these types are guaranteed safe as key types:
primitive types except Symbol: null, number, boolean, string
array, object literal
Current version of dcf use a custom version of msgpack5 for key serialize, which support undefined value.
Last updated