无法使用pyspark连接到EC2中的Spark群集

我遵循Spark的网站上的说明,我有一个主人和一个奴隶在我的亚马逊运行。 但是,我无法使用pyspark连接到主节点

我可以使用SSH连接到主节点,没有任何问题。

这是我的命令spark-ec2 --key-pair = graph-cluster --identity-file = / Users / .ssh / pem.pem --region = us-east-1 --zone = us-east-1a启动图-簇

我可以去http://ec2-54-152-xx-xxx.compute-1.amazonaws.com:8080/

并看到Spark正在运行,我也看到了Spark Master在

spark://ec2-54-152-xx-xxx.compute-1.amazonaws.com:7077

但是当我运行命令

MASTER=spark://ec2-54-152-xx-xx.compute-1.amazonaws.com:7077 pyspark

我得到这个错误

2015-09-16 15:39:31,800 ERROR actor.OneForOneStrategy (Slf4jLogger.scala:apply$mcV$sp(66)) -
java.lang.NullPointerException
    at org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
    at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
    at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
    at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
    at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
    at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
    at org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
    at akka.actor.ActorCell.invoke(ActorCell.scala:487)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
    at akka.dispatch.Mailbox.run(Mailbox.scala:220)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2015-09-16 15:39:31,804 INFO  client.AppClient$ClientActor (Logging.scala:logInfo(59)) - Connecting to master akka.tcp://sparkMaster@ec2-54-152-xx-xxx.compute-1.amazonaws.com:7077/user/Master...
2015-09-16 15:39:31,955 INFO  util.Utils (Logging.scala:logInfo(59)) - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 52333.
2015-09-16 15:39:31,956 INFO  netty.NettyBlockTransferService (Logging.scala:logInfo(59)) - Server created on 52333
2015-09-16 15:39:31,959 INFO  storage.BlockManagerMaster (Logging.scala:logInfo(59)) - Trying to register BlockManager
2015-09-16 15:39:31,964 INFO  storage.BlockManagerMasterEndpoint (Logging.scala:logInfo(59)) - Registering block manager xxx:52333 with 265.1 MB RAM, BlockManagerId(driver, xxx, 52333)
2015-09-16 15:39:31,969 INFO  storage.BlockManagerMaster (Logging.scala:logInfo(59)) - Registered BlockManager
2015-09-16 15:39:32,458 ERROR spark.SparkContext (Logging.scala:logError(96)) - Error initializing SparkContext.
java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext
    at org.apache.spark.SparkContext.org$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
    at org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1503)
    at org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2007)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:543)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:214)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)
2015-09-16 15:39:32,460 INFO  spark.SparkContext (Logging.scala:logInfo(59)) - SparkContext already stopped.
Traceback (most recent call last):
  File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/pyspark/shell.py", line 43, in <module>
    sc = SparkContext(appName="PySparkShell", pyFiles=add_files)
  File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/pyspark/context.py", line 113, in __init__
    conf, jsc, profiler_cls)
  File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/pyspark/context.py", line 165, in _do_init
    self._jsc = jsc or self._initialize_context(self._conf._jconf)
  File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/pyspark/context.py", line 219, in _initialize_context
    return self._jvm.JavaSparkContext(jconf)
  File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 701, in __call__
  File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext
    at org.apache.spark.SparkContext.org$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
    at org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1503)
    at org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2007)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:543)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:214)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)

Spark_ec2不会打开主节点上的端口7077,以便从群集外部传入连接。

您可以检入AWS控制台/ EC2 /网络和安全/安全组,并检查图形集群主控安全组的入站选项卡。

您可以添加规则以打开到端口7077的入站连接。

但建议从EC2群集中的主机运行pyspark(实质上是Spark的App驱动程序),并避免在网络外部运行驱动程序。 造成这种情况的原因 - 增加延迟和防火墙连接设置问题 - 您需要打开一些端口,以便执行程序可以连接到计算机上的驱动程序。

所以要走的路是用这个命令登录到ssh集群:

spark-ec2 --key-pair=graph-cluster --identity-file=/Users/.ssh/pem.pem --region=us-east-1 --zone=us-east-1a login graph-cluster

然后从主服务器运行命令:

cd spark
bin/pyspark

您需要将相关文件(您的脚本和数据)传输给主。 我通常在S3上保存数据并用vim编辑脚本文件或者启动ipython笔记本。

顺便说一句,后者非常简单 - 您需要将传入连接的规则从计算机IP添加到EC2控制台主服务器安全组中的端口18888。 然后在集群上运行命令:

IPYTHON_OPTS =“notebook --pylab inline --port = 18888 --ip ='*'”pyspark

然后你可以访问它http://ec2-54-152-xx-xxx.compute-1.amazonaws.com:18888/

链接地址: http://www.djcxy.com/p/53431.html

上一篇: Can't connect to Spark cluster in EC2 using pyspark

下一篇: Distributed computing instance usage with starcluster Ipython parallel plugin