Can't connect to Spark cluster in EC2 using pyspark

I've followed the instructions on the website of Spark and I got 1 master and 1 slave running in my Amazon. However, I'm not able to connect to the master node using pyspark

I can connect to the master node using SSH without any problem.

Here's my command spark-ec2 --key-pair=graph-cluster --identity-file=/Users/.ssh/pem.pem --region=us-east-1 --zone=us-east-1a launch graph-cluster

I can go to http://ec2-54-152-xx-xxx.compute-1.amazonaws.com:8080/

and see that Spark is up and running I also see this Spark Master at

spark://ec2-54-152-xx-xxx.compute-1.amazonaws.com:7077

However when I run command

MASTER=spark://ec2-54-152-xx-xx.compute-1.amazonaws.com:7077 pyspark

I get this error

2015-09-16 15:39:31,800 ERROR actor.OneForOneStrategy (Slf4jLogger.scala:apply$mcV$sp(66)) -
java.lang.NullPointerException
    at org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
    at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
    at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
    at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
    at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
    at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
    at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
    at org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
    at akka.actor.ActorCell.invoke(ActorCell.scala:487)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
    at akka.dispatch.Mailbox.run(Mailbox.scala:220)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2015-09-16 15:39:31,804 INFO  client.AppClient$ClientActor (Logging.scala:logInfo(59)) - Connecting to master akka.tcp://sparkMaster@ec2-54-152-xx-xxx.compute-1.amazonaws.com:7077/user/Master...
2015-09-16 15:39:31,955 INFO  util.Utils (Logging.scala:logInfo(59)) - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 52333.
2015-09-16 15:39:31,956 INFO  netty.NettyBlockTransferService (Logging.scala:logInfo(59)) - Server created on 52333
2015-09-16 15:39:31,959 INFO  storage.BlockManagerMaster (Logging.scala:logInfo(59)) - Trying to register BlockManager
2015-09-16 15:39:31,964 INFO  storage.BlockManagerMasterEndpoint (Logging.scala:logInfo(59)) - Registering block manager xxx:52333 with 265.1 MB RAM, BlockManagerId(driver, xxx, 52333)
2015-09-16 15:39:31,969 INFO  storage.BlockManagerMaster (Logging.scala:logInfo(59)) - Registered BlockManager
2015-09-16 15:39:32,458 ERROR spark.SparkContext (Logging.scala:logError(96)) - Error initializing SparkContext.
java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext
    at org.apache.spark.SparkContext.org$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
    at org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1503)
    at org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2007)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:543)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:214)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)
2015-09-16 15:39:32,460 INFO  spark.SparkContext (Logging.scala:logInfo(59)) - SparkContext already stopped.
Traceback (most recent call last):
  File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/pyspark/shell.py", line 43, in <module>
    sc = SparkContext(appName="PySparkShell", pyFiles=add_files)
  File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/pyspark/context.py", line 113, in __init__
    conf, jsc, profiler_cls)
  File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/pyspark/context.py", line 165, in _do_init
    self._jsc = jsc or self._initialize_context(self._conf._jconf)
  File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/pyspark/context.py", line 219, in _initialize_context
    return self._jvm.JavaSparkContext(jconf)
  File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 701, in __call__
  File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext
    at org.apache.spark.SparkContext.org$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
    at org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1503)
    at org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2007)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:543)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:214)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
    at py4j.GatewayConnection.run(GatewayConnection.java:207)
    at java.lang.Thread.run(Thread.java:745)

Spark_ec2 doesn't not open port 7077 on master node for incoming connections from outside the cluster.

You can check in AWS console/EC2/Network & Security/Security Groups and check graph-cluster-master security group's Inbound tab.

You can add the rule to open inbound connection to port 7077.

But it is suggested to run pyspark (essentially Spark's App driver) from the master machine in EC2 cluster, and avoid running driver outside the network. The reason for this - increased delays and problems with settings firewall connections - you'll need to open some ports so executions could connection to driver on your machine.

So the way to go is to login to ssh cluster with this command:

spark-ec2 --key-pair=graph-cluster --identity-file=/Users/.ssh/pem.pem --region=us-east-1 --zone=us-east-1a login graph-cluster

And run the commands from the master server:

cd spark
bin/pyspark

You'll need to transfer related files (your script and data) to master. I usually keep data on S3 and edit script files with vim or start ipython notebook.

BTW the latter is very easy - you need to add the rule for incoming connections from your computer IP to port 18888 in EC2 console master's security group. And then run the command on a cluster:

IPYTHON_OPTS="notebook --pylab inline --port=18888 --ip='*'" pyspark

Then you can access it with http://ec2-54-152-xx-xxx.compute-1.amazonaws.com:18888/

链接地址: http://www.djcxy.com/p/53432.html

上一篇: R中的并行回归(可能与降雪有关)

下一篇: 无法使用pyspark连接到EC2中的Spark群集