Can't connect to Spark cluster in EC2 using pyspark
I've followed the instructions on the website of Spark and I got 1 master and 1 slave running in my Amazon. However, I'm not able to connect to the master node using pyspark
I can connect to the master node using SSH without any problem.
Here's my command spark-ec2 --key-pair=graph-cluster --identity-file=/Users/.ssh/pem.pem --region=us-east-1 --zone=us-east-1a launch graph-cluster
I can go to http://ec2-54-152-xx-xxx.compute-1.amazonaws.com:8080/
and see that Spark is up and running I also see this Spark Master at
spark://ec2-54-152-xx-xxx.compute-1.amazonaws.com:7077
However when I run command
MASTER=spark://ec2-54-152-xx-xx.compute-1.amazonaws.com:7077 pyspark
I get this error
2015-09-16 15:39:31,800 ERROR actor.OneForOneStrategy (Slf4jLogger.scala:apply$mcV$sp(66)) -
java.lang.NullPointerException
at org.apache.spark.deploy.client.AppClient$ClientActor$$anonfun$receiveWithLogging$1.applyOrElse(AppClient.scala:160)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at org.apache.spark.deploy.client.AppClient$ClientActor.aroundReceive(AppClient.scala:61)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2015-09-16 15:39:31,804 INFO client.AppClient$ClientActor (Logging.scala:logInfo(59)) - Connecting to master akka.tcp://sparkMaster@ec2-54-152-xx-xxx.compute-1.amazonaws.com:7077/user/Master...
2015-09-16 15:39:31,955 INFO util.Utils (Logging.scala:logInfo(59)) - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 52333.
2015-09-16 15:39:31,956 INFO netty.NettyBlockTransferService (Logging.scala:logInfo(59)) - Server created on 52333
2015-09-16 15:39:31,959 INFO storage.BlockManagerMaster (Logging.scala:logInfo(59)) - Trying to register BlockManager
2015-09-16 15:39:31,964 INFO storage.BlockManagerMasterEndpoint (Logging.scala:logInfo(59)) - Registering block manager xxx:52333 with 265.1 MB RAM, BlockManagerId(driver, xxx, 52333)
2015-09-16 15:39:31,969 INFO storage.BlockManagerMaster (Logging.scala:logInfo(59)) - Registered BlockManager
2015-09-16 15:39:32,458 ERROR spark.SparkContext (Logging.scala:logError(96)) - Error initializing SparkContext.
java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext
at org.apache.spark.SparkContext.org$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
at org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1503)
at org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2007)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:543)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
2015-09-16 15:39:32,460 INFO spark.SparkContext (Logging.scala:logInfo(59)) - SparkContext already stopped.
Traceback (most recent call last):
File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/pyspark/shell.py", line 43, in <module>
sc = SparkContext(appName="PySparkShell", pyFiles=add_files)
File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/pyspark/context.py", line 113, in __init__
conf, jsc, profiler_cls)
File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/pyspark/context.py", line 165, in _do_init
self._jsc = jsc or self._initialize_context(self._conf._jconf)
File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/pyspark/context.py", line 219, in _initialize_context
return self._jvm.JavaSparkContext(jconf)
File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 701, in __call__
File "/usr/local/Cellar/apache-spark/1.4.1/libexec/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext
at org.apache.spark.SparkContext.org$apache$spark$SparkContext$$assertNotStopped(SparkContext.scala:103)
at org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1503)
at org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2007)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:543)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:61)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:214)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
Spark_ec2 doesn't not open port 7077 on master node for incoming connections from outside the cluster.
You can check in AWS console/EC2/Network & Security/Security Groups and check graph-cluster-master security group's Inbound tab.
You can add the rule to open inbound connection to port 7077.
But it is suggested to run pyspark (essentially Spark's App driver) from the master machine in EC2 cluster, and avoid running driver outside the network. The reason for this - increased delays and problems with settings firewall connections - you'll need to open some ports so executions could connection to driver on your machine.
So the way to go is to login to ssh cluster with this command:
spark-ec2 --key-pair=graph-cluster --identity-file=/Users/.ssh/pem.pem --region=us-east-1 --zone=us-east-1a login graph-cluster
And run the commands from the master server:
cd spark
bin/pyspark
You'll need to transfer related files (your script and data) to master. I usually keep data on S3 and edit script files with vim or start ipython notebook.
BTW the latter is very easy - you need to add the rule for incoming connections from your computer IP to port 18888 in EC2 console master's security group. And then run the command on a cluster:
IPYTHON_OPTS="notebook --pylab inline --port=18888 --ip='*'" pyspark
Then you can access it with http://ec2-54-152-xx-xxx.compute-1.amazonaws.com:18888/
链接地址: http://www.djcxy.com/p/53432.html上一篇: R中的并行回归(可能与降雪有关)