difference between rdd.collect().toMap to rdd.collectAsMap()?
Is there any performance impact when I use collectAsMap on my RDD instead of rdd.collect().toMap ?
I have a key value rdd and I want to convert to HashMap as far I know collect() is not efficient on large data sets as it runs on driver can I use collectAsMap instead is there any performance impact ?
Original:
val QuoteHashMap=QuoteRDD.collect().toMap
val QuoteRDDData=QuoteHashMap.values.toSeq
val QuoteRDDSet=sc.parallelize(QuoteRDDData.map(x => x.toString.replace("(","").replace(")","")))
QuoteRDDSet.saveAsTextFile(Quotepath)
Change:
val QuoteHashMap=QuoteRDD.collectAsMap()
val QuoteRDDData=QuoteHashMap.values.toSeq
val QuoteRDDSet=sc.parallelize(QuoteRDDData.map(x => x.toString.replace("(","").replace(")","")))
QuoteRDDSet.saveAsTextFile(Quotepath)
The implementation of collectAsMap
is the following
def collectAsMap(): Map[K, V] = self.withScope {
val data = self.collect()
val map = new mutable.HashMap[K, V]
map.sizeHint(data.length)
data.foreach { pair => map.put(pair._1, pair._2) }
map
}
Thus, there is no performance difference between collect
and collectAsMap
, because collectAsMap
calls under the hood also collect
.
No difference. Avoid using collect() as much as you can as it destroys the concept of parallelism and collects the data on the driver.
链接地址: http://www.djcxy.com/p/88360.html