Data Serializing
Data in RDD may be serialized in serveral cases:
- To persist data on disk or off-heap memory. 
- To transfer data for repartition 
- To transfer result to master & client 
- To get a hash value for combineByKey/distinct 
Serialize will comes with two method:
V8.serialize()/V8.deserialize()
For case 1-3, data will be serialized and deserialized later. V8.serialize() is the best function for this. It can deal many type of data:
- primitive types: undefined, null, number, boolean, string, Symbol 
- array, object literal 
- Date, RegExp, Set, Map 
- Buffer/typed arrays 
So in generally speaking you could use any type except custom classes.
Key Serialize
But there's a problem for V8.serialize: sometimes it gives different result for same input, especially for splitted string. So when we need another serialize function for combineByKey/distinct.
The function may change in future, but these types are guaranteed safe as key types:
- primitive types except Symbol: null, number, boolean, string 
- array, object literal 
Current version of dcf use a custom version of msgpack5 for key serialize, which support undefined value.
Last updated