Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
InAndOutBrennan
Dec 11, 2008
We're using Spark at work and though I usually write Sparkjobs in Java (where I'm decently competent) I thought I'd play around a bit with Scala, because it seems really nice. I'm very new and have dived right in without reading too much so I'm happily shooting myself in the foot every day. I do have some grounding in functional programming though.

Say I have an Iterator[Tuple2[String, String]] and I want to get to a Map[String, TreeSet[String]] where every ._1 (first item in the tuple) is the key and the set for every key contains every ._2 for that key.

Best I came up with is, pseudocodish:
code:
Iterator[Tuple2[String, String]]
	.toList // Iterator has no group by
	.groupBy(t => t._1) // Now we have a Map[String, List[Tuple2[String, String]]]
	.mapValues(tupleList => tupleList.map(t => t._2)) // Now we have a Map[String, List[String]]
	.mapValues(stringList => new TreeSet[String] ++ stringList) // Create a new TreeSet and add everything from the list to it
For code that actually runs an example is (starting with the List though, you can do a.iterator first to get the exact same starting point I have):
code:
val a = List(Tuple2("a", "1"), Tuple2("a", "2"), Tuple2("b", "1"), Tuple2("b", "2"), Tuple2("b", "3"))
val b = a.groupBy(t => t._1).mapValues(v => v.map(v => v._2)).mapValues(v => TreeSet[String] ++ v)
Which works, but is horribly slow compared to bringing in a java.util.HashMap and java.util.TreeSet and doing it much less elegantly. Horribly slow in this case for comparison is that the Scala approach hadn't finished doing a single task out of ~200 in 30 minutes while the Javabased approach finish in approx 6 minutes.

So im guessing I'm missing something obvious here and I'm creating way too many objects/maps/lists or something.

Adbot
ADBOT LOVES YOU

InAndOutBrennan
Dec 11, 2008

Thanks!

I'm having some trouble with the "build yo map up by matching in here" put you've pointed me in a couple of interesting directions, breakout seems to be very interesting if I can get my head around it.

Edit
What I ended up with:
code:
a.iterator.foldLeft(Map[String, TreeSet[String]]().withDefaultValue(TreeSet[String]().empty))((m, s) => m + (s._1 -> (m(s._1) + s._2)))
Initial tests shows this runs reasonably fast (10ish minutes), haven't been able to compare the two head to head yet though. Need an empty cluster for that. Wonder what makes the huge difference, but thanks again.

InAndOutBrennan fucked around with this message at 14:06 on Jun 16, 2016

  • Locked thread