Add async APIs #28

henridf · 2016-01-16T04:20:46Z

This is a breaking change that transforms all methods that "run computation" from synchronous to asynchronous (node-style callback) operation. The existing synchronous methods are renamed with a "Sync" suffix.

The methods converted are:

DataFrame.collect(..)
DataFrame.columns(..)
DataFrame.count(..)
DataFrame.head(..)

DataFrameReader.json(..)
DataFrameReader.text(..)

DataFrameWriter.json(..)
DataFrameWriter.text(..)
DataFrameWriter.saveAsTable(..)
DataFrameWriter.insertInto(..)

henridf · 2016-01-16T17:38:34Z

@tobilg after I merge this, you can use the async apis to run concurrent jobs. no need for explicit java threads! If you can try this out I'd love to hear how it worked for you.

The naming scheme is unorthodox (sync versions have bare names, async versions are suffixed "Async") to avoid having to touch tons of lines. This should be fixed, but isn't urgent as none of the raw jvm objects are exposed anywhere.

This change turns the "action" methods (those that run a computation) into async methods taking node-style callbacks. For each of the asyncified methods, an additional synchronous version is added, which has the same name with a 'Sync' suffix. The methods converted here are: collect(..) columns(..) count(..) head(..)

This change turns the "action" methods (those that run a computation) into async methods taking node-style callbacks. For each of the asyncified methods, an additional synchronous version is added, which has the same name with a 'Sync' suffix. The methods converted here are: json(..) text(..) load(..) // undocumented

This change turns the "action" methods (those that run a computation) into async methods taking node-style callbacks. For each of the asyncified methods, an additional synchronous version is added, which has the same name with a 'Sync' suffix. The methods converted here are: json(..) text(..) saveAsTable insertInto save(..) // undocumented

Add async APIs

tobilg · 2016-01-16T18:41:46Z

Thanks! Have you successfully tested that actions get executed in parallel? From my understanding this wasn't possible because of the way Spark works...

Or do the async calls just wait until the predecessor is finished?

henridf · 2016-01-16T18:56:47Z

I did some basic testing with load() by instrumenting the spark sources to
confirm that multiple async loads were indeed concurrent.

Here's some pointers to why this works: when node-java wraps jvm methods,
the asynchronous ones end up in
https://github.com/joeferner/node-java/blob/master/src/methodCallBaton.cpp#L64
where they run on a different worker thread.
This post provides some good context on node worker threads:
https://www.future-processing.pl/blog/on-problems-with-threads-in-node-js/

As to the statement in the Spark docs that "multiple parallel jobs can run
simultaneously if they were submitted from separate threads", my
understanding is that they mention threads because it's a way to initiate
the underlying calls concurrently in Scala/Java (since in the basic spark API, the calls
are synchronous, like they were in these javascript wrappers before this
PR). In our case, we get concurrency via the libuv thread pool.

On 16 January 2016 at 10:41, Tobi [email protected] wrote:

Thanks! Have you successfully tested that actions get executed in
parallel? From my understanding this wasn't possible because of the way
Spark works...

Or do the async calls just wait until the predecessor is finished?

—
Reply to this email directly or view it on GitHub
#28 (comment)
.

tobilg · 2016-01-16T19:06:27Z

That sounds great! I'll try to test this on monday when I'm back in the office.

One idea: To get rid of the callbacks, what do you think about using ES7 async/await and create promise wrappers for the async callback action methods? I use this in my project via ad-hoc babel.js transpilation...

tobilg · 2016-01-16T19:08:24Z

One question: Did you use the same context in your tests?

henridf · 2016-01-16T19:19:49Z

Yes, I used the same context in my tests. Is that what you were planning to do?

henridf · 2016-01-16T19:26:48Z

Re callbacks, I just did it that way as a simple starting point, from which it's easy for users to get promisified versions.

Definitely not a personal preference for callbacks here :)

I also hesitated to add promisified versions as part of this PR... that's easy to do but I'm just not yet sure if it's best to build those in. It sounds like you think they should be?

tobilg · 2016-01-16T19:28:15Z

Great, and yes, that's what I'd want to do. I'll test this on monday and try to incorporate this in my project.

tobilg · 2016-01-16T19:31:41Z

Sorry, last comment was regarding the context. Concerning the promisified functions, I think they would be a nice addon. And it would make it really easy to use await-style quasi-synchronous code style.

henridf · 2016-01-16T22:46:29Z

Filed #29 for promisified functions

On 16 January 2016 at 11:31, Tobi [email protected] wrote:

Sorry, last comment was regarding the context. Concerning the promisified
functions, I think they would be a nice addon. And it would make it really
easy to use await-style quasi-synchronous code style.

—
Reply to this email directly or view it on GitHub
#28 (comment)
.

tobilg · 2016-01-17T20:37:58Z

Could you mayble share your test code? I was trying to promisify collect() and call the execution of two different operations (within the same context) with async.parallel(). No real luck so far unfortunately...

henridf · 2016-01-17T22:07:09Z

When I originally checked (for DataFrameReader.load() I added "start" and "stop" printlns to the beginning and end of load() in DataFrameWriter.scala, started two loads at the same time, and observed that the sequence of logs was "start, start, stop, stop" (as opposed to start, stop, start, stop in the case of sequential execution).

Now I just validated that collect() is concurrent. This time i didn't touch the spark sources. Instead I did this:

create two json files, one big (1m rows) and one small (1 row)
load both as dataframes (using the same sqlContext)
collect both in succession, starting with the "big" one, and observe that the small one finishes first.

spark-node> var big= sqlContext.read().jsonSync("./data/big.json")
spark-node> var small = sqlContext.read().jsonSync("./data/small.json")
spark-node> big.collect((err, res) => console.log("big has " + res.length + " rows")); small.collect((err, res) => console.log("small has " + res.length + " rows"));

The output was:

spark-node> small has 1 rows
big has 1000000 rows

Showing that the collect that started second finished first.

tobilg · 2016-01-18T08:21:53Z

Thanks! Don't you also need to enable the fair scheduler for this,e.g. "spark.scheduler.mode": "FAIR" as outlined at http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application or does it work out of the box?

tobilg · 2016-01-18T11:50:36Z

Update: I got it to work with my project as well. The problems were that I was using some event listeners which got overwritten when issuing parallel request... D'oh!

henridf · 2016-01-18T15:50:12Z

Excellent!

On Monday, January 18, 2016, Tobi [email protected] wrote:

Update: I got it to work with my project as well. The problems were that I
was using some event listeners which got overwritten when issuing parallel
request... D'oh!

—
Reply to this email directly or view it on GitHub
#28 (comment)
.

tobilg · 2016-01-18T16:20:10Z

Thanks! Regarding the fair scheduler, have you used this as well? With my Spark 1.6.0 the standard method is FIFO which I guess is not recommended:

By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.

Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.

henridf · 2016-01-18T16:45:35Z

I haven't touched the scheduler settings in any way so far. As to FIFO vs
fair, presumably fair is better but it all depends on the needs of your
application and users...

On 18 January 2016 at 08:20, Tobi [email protected] wrote:

Thanks! Regarding the fair scheduler, have you used this as well? With my
Spark 1.6.0 the standard method is FIFO which I guess is not recommended
http://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
:

By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is
divided into “stages” (e.g. map and reduce phases), and the first job gets
priority on all available resources while its stages have tasks to launch,
then the second job gets priority, etc. If the jobs at the head of the
queue don’t need to use the whole cluster, later jobs can start to run
right away, but if the jobs at the head of the queue are large, then later
jobs may be delayed significantly.

Starting in Spark 0.8, it is also possible to configure fair sharing
between jobs. Under fair sharing, Spark assigns tasks between jobs in a
“round robin” fashion, so that all jobs get a roughly equal share of
cluster resources. This means that short jobs submitted while a long job is
running can start receiving resources right away and still get good
response times, without waiting for the long job to finish. This mode is
best for multi-user settings.

—
Reply to this email directly or view it on GitHub
#28 (comment)
.

…enridf#28 - Added some basic tests of these actions (methods)

henridf force-pushed the async-apis branch from a57f9ec to e6e1139 Compare January 16, 2016 18:05

henridf added 5 commits January 16, 2016 10:10

java: configure to get async versions of methods

32e4168

The naming scheme is unorthodox (sync versions have bare names, async versions are suffixed "Async") to avoid having to touch tons of lines. This should be fixed, but isn't urgent as none of the raw jvm objects are exposed anywhere.

Add a few missing type annotations

3ea914b

henridf force-pushed the async-apis branch from e6e1139 to 73253dc Compare January 16, 2016 18:10

henridf added a commit that referenced this pull request Jan 16, 2016

Merge pull request #28 from henridf/async-apis

eebfa06

Add async APIs

henridf merged commit eebfa06 into master Jan 16, 2016

henridf deleted the async-apis branch January 16, 2016 18:14

This was referenced Jan 16, 2016

Async APIs #26

Closed

Support concurrent jobs #27

Closed

henridf mentioned this pull request Jan 16, 2016

Add promise version of async functions #29

Open

tobilg added a commit to tobilg/apache-spark-node that referenced this pull request Jan 20, 2016

- Added the promisified versions of Dataframe actions, as outlined in h…

694307b

…enridf#28 - Added some basic tests of these actions (methods)

tobilg mentioned this pull request Jan 20, 2016

Added promisified actions #32

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add async APIs #28

Add async APIs #28

henridf commented Jan 16, 2016

henridf commented Jan 16, 2016

tobilg commented Jan 16, 2016

henridf commented Jan 16, 2016

tobilg commented Jan 16, 2016

tobilg commented Jan 16, 2016

henridf commented Jan 16, 2016

henridf commented Jan 16, 2016

tobilg commented Jan 16, 2016

tobilg commented Jan 16, 2016

henridf commented Jan 16, 2016

tobilg commented Jan 17, 2016

henridf commented Jan 17, 2016

tobilg commented Jan 18, 2016

tobilg commented Jan 18, 2016

henridf commented Jan 18, 2016

tobilg commented Jan 18, 2016

henridf commented Jan 18, 2016

Add async APIs #28

Add async APIs #28

Conversation

henridf commented Jan 16, 2016

henridf commented Jan 16, 2016

tobilg commented Jan 16, 2016

henridf commented Jan 16, 2016

tobilg commented Jan 16, 2016

tobilg commented Jan 16, 2016

henridf commented Jan 16, 2016

henridf commented Jan 16, 2016

tobilg commented Jan 16, 2016

tobilg commented Jan 16, 2016

henridf commented Jan 16, 2016

tobilg commented Jan 17, 2016

henridf commented Jan 17, 2016

tobilg commented Jan 18, 2016

tobilg commented Jan 18, 2016

henridf commented Jan 18, 2016

tobilg commented Jan 18, 2016

henridf commented Jan 18, 2016