Low cache hit ratio when running Spark on Alluxio

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Low cache hit ratio when running Spark on Alluxio

Jerry Yan
Hi,

We are running Spark jobs on an Alluxio Cluster which is serving 13 gigabytes of data with 99% of the data is in memory. I was hoping to speed up the Spark jobs by reading the in-memory data in Alluxio, but found Alluxio local hit rate is only 1.68%, while Alluxio remote hit rate is 98.32%. By monitoring the network IO across all worker nodes through "dstat" command, I found that only two nodes had about 1GB of recv or send in the whole precessand, and it is sending  1GB or receiving 1GB during Spark Shuffle Stage. Is there any metrics I could check or configuration to tune ?


Best,

Jerry

Reply | Threaded
Open this post in threaded view
|

Re: Low cache hit ratio when running Spark on Alluxio

Bin Fan
Depending on the Alluxio version you are running, e..g, for 2.0, the metrics of the local short-circuit read is not turned on by default.
So I would suggest you to first turn on the metrics collecting local short-circuit reads by setting
alluxio.user.metrics.collection.enabled=true

Regarding the generic question to achieve high data locality when running Spark on Alluxio, can you read
and follow the suggests there. E.g., things can be weird on running Spark on YARN for this case.

If you need more detailed instructions, feel free to join Alluxio community channel https://slackin.alluxio.io

- Bin Fan

On Wed, Aug 28, 2019 at 1:49 AM Jerry Yan <[hidden email]> wrote:
Hi,

We are running Spark jobs on an Alluxio Cluster which is serving 13 gigabytes of data with 99% of the data is in memory. I was hoping to speed up the Spark jobs by reading the in-memory data in Alluxio, but found Alluxio local hit rate is only 1.68%, while Alluxio remote hit rate is 98.32%. By monitoring the network IO across all worker nodes through "dstat" command, I found that only two nodes had about 1GB of recv or send in the whole precessand, and it is sending  1GB or receiving 1GB during Spark Shuffle Stage. Is there any metrics I could check or configuration to tune ?


Best,

Jerry