Has anybody had experience integrating C/C++ code into Spark jobs?
I have done some work on this topic using JNA. I wrote a FlatMapFunction that processes all partition entries using a C++ library. This approach works well, but there are some tradeoffs:
* Shipping the native dylib with the app jar and loading it at runtime requires a bit of work (on top of normal JNA usage)
* Native code doesn't respect the executor heap limits. Under heavy memory pressure, the native code can sometimes ENOMEM sporadically.
* While JNA can map Strings, structs, and Java primitive types, the user still needs to deal with more complex objects. E.g. re-serialize protobuf/thrift objects, or provide some other encoding for moving data between Java and C/C++.
* C++ static is not thread-safe before C++11, so the user sometimes needs to take care running inside multi-threaded executors
* Avoiding memory copies can be a little tricky
One other alternative approach comes to mind is pipe(). However, PipedRDD requires copying data over pipes, does not support binary data (?), and native code errors that crash the subprocess don't bubble up to the Spark job as nicely as with JNA.
Is there a way to expose raw, in-memory partition/block data to native code?
Has anybody else attacked this problem a different way?
More thoughts. I took a deeper look at BlockManager, RDD, and friends. Suppose one wanted to get native code access to un-deserialized blocks. This task looks very hard. An RDD behaves much like a Scala iterator of deserialized values, and interop with BlockManager is all on deserialized data. One would probably need to rewrite much of RDD, CacheManager, etc in native code; an RDD subclass (e.g. PythonRDD) probably wouldn't work.
So exposing raw blocks to native code looks intractable. I wonder how fast Java/Kyro can SerDe of byte arrays. E.g. suppose we have an RDD<T> where T is immutable and most of the memory for a single T is a byte array. What is the overhead of SerDe-ing T? (Does Java/Kyro copy the underlying memory?) If the overhead is small, then native access to raw blocks wouldn't really yield any advantage.