Using R in Apache Spark

There are some options to access R libraries in Spark:

  • directly using sparkr
  • using language bindings like rpy2 or rscala
  • using standalone service like opencpu
  • It looks like SparkR is quite limited, OpenCPU requires keeping additional service and bindings can have stability issue. Is there something else specific to Spark architecture which make using any solution not easy.

    Do you have any experience with integrating R and Spark you can share?


    The main language for the project seems like an important factor.

    If pyspark is a good way to use Spark for you (meaning that you are accessing Spark from Python) accessing R through rpy2 should not make much difference from using any other Python library with a C-extension.

    There exist reports of users doing so (although with occasional questions such as How can I partition pyspark RDDs holding R functions or Can I connect an external (R) process to each pyspark worker during setup)

    If R is your main language, helping the SparkR authors with feedback or contributions where you feel there are limitation would be way to go.

    If your main language is Scala, rscala should be your first try.

    While the combo pyspark + rpy2 would seem the most "established" (as in "uses the oldest and probably most-tried codebase"), this does not necessarily mean that it is the best solution (and young packages can evolve quickly). I'd assess first what is the preferred language for the project and try options from there.

    链接地址: http://www.djcxy.com/p/90582.html

    上一篇: eclipse的git插件有多稳定?

    下一篇: 在Apache Spark中使用R