本文共 9012 字,大约阅读时间需要 30 分钟。
作者:周志湖
网名:摇摆少年梦 微信号:zhouzhihubeyond(1)union
union将两个RDD数据集元素合并,类似两个集合的并集 union函数参数:/** * Return the union of this RDD and another one. Any identical elements will appear multiple * times (use `.distinct()` to eliminate them). */ def union(other: RDD[T]): RDD[T]
RDD与另外一个RDD进行Union操作之后,两个数据集中的存在的重复元素
代码如下:scala> val rdd1=sc.parallelize(1 to 5)rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[15] at parallelize at:21scala> val rdd2=sc.parallelize(4 to 8)rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[16] at parallelize at :21//存在重复元素scala> rdd1.union(rdd2).collectres13: Array[Int] = Array(1, 2, 3, 4, 5, 4, 5, 6, 7, 8)
(2)intersection
方法返回两个RDD数据集的交集 函数参数: /** * Return the intersection of this RDD and another one. The output will not contain any duplicate * elements, even if the input RDDs did. * * Note that this method performs a shuffle internally. */ def intersection(other: RDD[T]): RDD[T]使用示例:
scala> rdd1.intersection(rdd2).collectres14: Array[Int] = Array(4, 5)
(3)distinct
distinct函数将去除重复元素 distinct函数参数:/**
* Return a new RDD containing the distinct elements in this RDD. */ def distinct(): RDD[T]scala> val rdd1=sc.parallelize(1 to 5)rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at:21scala> val rdd2=sc.parallelize(4 to 8)rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at :21scala> rdd1.union(rdd2).distinct.collectres0: Array[Int] = Array(6, 1, 7, 8, 2, 3, 4, 5)
(4)groupByKey([numTasks])
输入数据为(K, V) 对, 返回的是 (K, Iterable) ,numTasks指定task数量,该参数是可选的,下面给出的是无参数的groupByKey方法 /** * Group the values for each key in the RDD into a single sequence. Hash-partitions the * resulting RDD with the existing partitioner/parallelism level. The ordering of elements * within each group is not guaranteed, and may even differ each time the resulting RDD is * evaluated. * * Note: This operation may be very expensive. If you are grouping in order to perform an * aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]] * or [[PairRDDFunctions.reduceByKey]] will provide much better performance. */ def groupByKey(): RDD[(K, Iterable[V])]scala> val rdd1=sc.parallelize(1 to 5)rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at:21scala> val rdd2=sc.parallelize(4 to 8)rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at :21scala> rdd1.union(rdd2).map((_,1)).groupByKey.collectres2: Array[(Int, Iterable[Int])] = Array((6,CompactBuffer(1)), (1,CompactBuffer(1)), (7,CompactBuffer(1)), (8,CompactBuffer(1)), (2,CompactBuffer(1)), (3,CompactBuffer(1)), (4,CompactBuffer(1, 1)), (5,CompactBuffer(1, 1)))
(5)reduceByKey(func, [numTasks])
reduceByKey函数输入数据为(K, V)对,返回的数据集结果也是(K,V)对,只不过V为经过聚合操作后的值 /** * Merge the values for each key using an associative reduce function. This will also perform * the merging locally on each mapper before sending results to a reducer, similarly to a * “combiner” in MapReduce. Output will be hash-partitioned with numPartitions partitions. */ def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)]使用示例:
scala> rdd1.union(rdd2).map((_,1)).reduceByKey(_+_).collectres4: Array[(Int, Int)] = Array((6,1), (1,1), (7,1), (8,1), (2,1), (3,1), (4,2), (5,2))
(6)sortByKey([ascending], [numTasks])
对输入的数据集按key排序 sortByKey方法定义/** * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling * `collect` or `save` on the resulting RDD will return or output an ordered list of records * (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in * order of the keys). */ // TODO: this currently doesn't work on P other than Tuple2! def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length) : RDD[(K, V)]
使用示例:
scala> var data = sc.parallelize(List((1,3),(1,2),(1, 4),(2,3),(7,9),(2,4)))data: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[20] at parallelize at:21scala> data.sortByKey(true).collectres10: Array[(Int, Int)] = Array((1,3), (1,2), (1,4), (2,3), (2,4), (7,9))
(7)join(otherDataset, [numTasks])
对于数据集类型为 (K, V) 及 (K, W)的RDD,join操作后返回类型为 (K, (V, W)),join函数有三种: def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] def leftOuterJoin[W]( other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, Option[W]))] def rightOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner) : RDD[(K, (Option[V], W))]使用示例:
scala> val rdd1=sc.parallelize(Array((1,2),(1,3)) | )rdd1: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[24] at parallelize at:21scala> val rdd2=sc.parallelize(Array((1,3)))rdd2: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[32] at parallelize at :21scala> rdd1.join(rdd2).collectres13: Array[(Int, (Int, Int))] = Array((1,(3,3)), (1,(2,3)))
scala> rdd1.leftOuterJoin(rdd2).collectres15: Array[(Int, (Int, Option[Int]))] = Array((1,(3,Some(3))), (1,(2,Some(3))))
scala> rdd1.rightOuterJoin(rdd2).collectres16: Array[(Int, (Option[Int], Int))] = Array((1,(Some(3),3)), (1,(Some(2),3)))
(8)cogroup(otherDataset, [numTasks])
如果输入的RDD类型为(K, V) 和(K, W),则返回的RDD类型为 (K, (Iterable, Iterable)) . 该操作与 groupWith等同方法定义:
/** * For each key k inthis
or other
, return a resulting RDD that contains a tuple with the * list of values for that key in this
as well as other
. */ def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner) : RDD[(K, (Iterable[V], Iterable[W]))] scala> val rdd1=sc.parallelize(Array((1,2),(1,3)) | )rdd1: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[24] at parallelize at:21scala> val rdd2=sc.parallelize(Array((1,3)))rdd2: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[32] at parallelize at :21scala> rdd1.cogroup(rdd2).collectres17: Array[(Int, (Iterable[Int], Iterable[Int]))] = Array((1,(CompactBuffer(3, 2),CompactBuffer(3))))scala> rdd1.groupWith(rdd2).collectres18: Array[(Int, (Iterable[Int], Iterable[Int]))] = Array((1,(CompactBuffer(2, 3),CompactBuffer(3))))
(9)cartesian(otherDataset)
求两个RDD数据集间的笛卡尔积 函数定义: /** * Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of * elements (a, b) where a is inthis
and b is in other
. */ def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)] scala> val rdd1=sc.parallelize(Array(1,2,3,4))rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[52] at parallelize at:21scala> val rdd2=sc.parallelize(Array(5,6))rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[53] at parallelize at :21scala> rdd1.cartesian(rdd2).collectres21: Array[(Int, Int)] = Array((1,5), (1,6), (2,5), (2,6), (3,5), (4,5), (3,6), (4,6))
(10)coalesce(numPartitions)
将RDD的分区数减至指定的numPartitions分区数函数定义:
/** * Return a new RDD that is reduced intonumPartitions
partitions. * * This results in a narrow dependency, e.g. if you go from 1000 partitions * to 100 partitions, there will not be a shuffle, instead each of the 100 * new partitions will claim 10 of the current partitions. * * However, if you’re doing a drastic coalesce, e.g. to numPartitions = 1, * this may result in your computation taking place on fewer nodes than * you like (e.g. one node in the case of numPartitions = 1). To avoid this, * you can pass shuffle = true. This will add a shuffle step, but means the * current upstream partitions will be executed in parallel (per whatever * the current partitioning is). * * Note: With shuffle = true, you can actually coalesce to a larger number * of partitions. This is useful if you have a small number of partitions, * say 100, potentially with a few partitions being abnormally large. Calling * coalesce(1000, shuffle = true) will result in 1000 partitions with the * data distributed using a hash partitioner. */ def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null) : RDD[T] 示例代码:
scala> val rdd1=sc.parallelize(1 to 100,3)rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[55] at parallelize at:21scala> val rdd2=rdd1.coalesce(2)rdd2: org.apache.spark.rdd.RDD[Int] = CoalescedRDD[56] at coalesce at :23
repartition(numPartitions),功能与coalesce函数相同,实质上它调用的就是coalesce函数,只不是shuffle = true,意味着可能会导致大量的网络开销。
方法定义: /** * Return a new RDD that has exactly numPartitions partitions. * * Can increase or decrease the level of parallelism in this RDD. Internally, this uses * a shuffle to redistribute data. * * If you are decreasing the number of partitions in this RDD, consider usingcoalesce
, * which can avoid performing a shuffle. */ def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope { coalesce(numPartitions, shuffle = true) }