PySpark Cookbook
上QQ阅读APP看书,第一时间看更新

.join(...) transformation

The join(RDD') transformation returns an RDD of (key, (val_left, val_right)) when calling RDD (key, val_left) and RDD (key, val_right). Outer joins are supported through left outer join, right outer join, and full outer join. 

Look at the following code snippet:

# Flights data
# e.g. (u'JFK', u'01010900')
flt = flights.map(lambda c: (c[3], c[0]))

# Airports data
# e.g. (u'JFK', u'NY')
air = airports.map(lambda c: (c[3], c[1]))

# Execute inner join between RDDs
flt.join(air).take(5)

This will give you the following result:

# Output
[(u'JFK', (u'01010900', u'NY')),
(u'JFK', (u'01011200', u'NY')),
(u'JFK', (u'01011900', u'NY')),
(u'JFK', (u'01011700', u'NY')),
(u'JFK', (u'01010800', u'NY'))]