Ray master #3

eugenevinitsky · 2018-12-09T20:15:29Z

Upgrading Ray to be close to master.

…ment (ray-project#2995) * simplify vec batch requirements * Update rllib-training.rst * Update rllib-training.rst * Update rllib-training.rst * Update rllib-training.rst * Update rllib-training.rst * Update rllib-models.rst

…ors (ray-project#2967) * update * link it * warn about truncation * fix * Update rllib-training.rst * deprecate tests failing

… of PPO (ray-project#2974) * fix * fix * fix it * propagate conf to action dist * move carla example too * rr * Update policies.py * wip * lint

…ava worker. (ray-project#3002) This fixes a bug in which Java actor methods inherit the resource requirements of the actor creation task.

* remove legacy * remove reshaper

This tests the case in which a worker is blocked in a call to ray.get or ray.wait, and then the worker dies. Then later, the object that the worker was waiting for becomes available. We need to make sure not to try to send a message to the dead worker and then die. Related to ray-project#2790.

before fix,RAY_FUN_CACHE use only get method ,can only get null fix : put after create

…in. (ray-project#2862)

…t#3018)

…ay-project#3003) Move function/actor exporting & loading code to function_manager.py to prepare the code change for function descriptor for python.

This commit fix some small defects. 1. Remove a comment that should have been removed in ray-project#3003 2. Remove `redis_protected_mode` that is never used in `ray.init()` 3. Fix `object_id_seed` that is forgotten to be passed into `ray._init()` 4. Remove several redundant brackets.

…-project#2935)

…ect#2837) * Introduce concept of resources required for placement. * Add placement resources to task spec * Update java worker * Update taskinfo.java

…ject#3029) Improve logging message when plasma store is started.

* Update rsync command * Escape rsync locations * Fix the accidental variable move * Update rsync to use -s flag

## What do these changes do? 1. Add a configuration item `driver.resource-path`. 2. Load driver resources from the local path which is specified in the `ray.conf`. Before this change, we should add all driver resources(like user's jar package, dependencies package and config files) into `classpath`. After this change, we should add the driver resources into the mount path which we can configure it in `ray.conf`, and we shouldn't configure `classpath` for driver resources any more. ## Related issue number N/A

* bugfix: env exists check error * support to avoid re-build pyarrow in project * bugfix: adapt gtest for centos lib64 * bugfix: check gtest lib exists in the directory * bugfix: find gtest with checking all libs exists * prefix RAY_ to thirdparty env variables to avoid conflicts with other module * arrow use glog from ray * change the glog and gtest install dir

This PR improves some java codes, and removes some duplicated code.

## What do these changes do? Fix the issue how we load driver resources by a specified path. Also this addressed the comments from the related PR [3044](ray-project#3044). ## Related PRs: [ray-project#3044](ray-project#3044) and [ray-project#3001](ray-project#3001).

…l plasma java lib (ray-project#3047)

…y states (ray-project#3032)

* fix er * update

…roject#2766)

…ight (ray-project#3061)

…oject#3068) ## What do these changes do? Fix the misleading comments in code for: - `EPISODES_THIS_ITER` - `EPISODES_TOTAL` Had noted it before and planned to fix it along with some other changes but seemed very relevant to stay next to ray-project#3058 so sending this now.

* Adding logo to readme * Updating link * Add badge * Addressing comments * Moving logo * Change align * Move image

…t#3448) This includes a fix so the TensorFlow op releases memory properly (apache/arrow#3061) and the possibility to store arrow data structures in plasma (apache/arrow#2832). ray-project#3404

ray.wait depends on callbacks from the GCS to decide when an object has appeared in the cluster. The raylet crashes if a callback is received for a wait request that has already completed, but this actually can happen, depending on the order of calls. More precisely: 1. Objects A and B are put in the cluster. 2. Client calls ray.wait([A, B], num_returns=1). 3. Client subscribes to locations for A and B. Locations are cached for both, so callbacks are posted for each. 4. Callback for A fires. The wait completes and the request is removed. 5. Callback for B fires. The wait request no longer exists and raylet crashes.

…ion sizes (ray-project#3444)

…ge (ray-project#3426) * fix clip * tweak wording * remove squash entirely * Update rllib-models.rst * fix argument order * Apply suggestions from code review Co-Authored-By: ericl <[email protected]>

* Add custom cluster name to exec info * Update submit info to match exec info

…k with curriculum example (ray-project#3451) * train step and docs * debug * doc * doc * fix examples * fix code * integration test * fix * ... * space * instance * Update .travis.yml * fix test

Changes include: - Notify Components on Requeue - Slight refactoring of Node Failure handling - Better tests

…ravis

* Add the extra fallback for serialization. * Better comments & warnings. quotes. * Update test/runtest.py Co-Authored-By: suquark <[email protected]> * Update test/runtest.py Co-Authored-By: suquark <[email protected]> * linting * Don't hijack too much errors. * simplify the test * Update runtest.py * simplify

* increase container memory and shm to 20G * variables are POWERFUL

auto wrap multi-agent dict and tuple spaces by keeping a policy -> preprocessor in the sampler add some Q-learning debug stats report min, max of custom metrics better errors

…sample_from instead) (ray-project#3457) * wip * exclude

It is possible that `test_free_objects_multi_node` would fail sometimes. If we run this test 20 times, we may found at least one failure. The cause is that the test is based on function tasks. One raylet may create more than one worker to execute the tasks. So flush operations may be separated to several workers and not clean all the worker objects held by the plasma client. In this PR, I change function task to actor tasks, which guarantee all the tasks are executed in one worker of a raylet.

## What do these changes do? JSON Logger now uses cloudpickle to dump the configs as welll, which pkls the functions needed for multi-agent replay. ## Related issue number

) * Removing the check about the size re: ray-project#3450 * Addressing comments * Update services.py

* Init commit for async plasma client * Create an eventloop model for ray/plasma * Implement a poll-like selector base on `ray.wait`. Huge improvements. * Allow choosing workers & selectors * remove original design * initial implementation of epoll-like selector for plasma * Add a param for `worker` used in `PlasmaSelectorEventLoop` * Allow accepting a `Future` which returns object_id * Do not need `io.py` anymore * Create a basic testing model * fix: `ray.wait` returns tuple of lists * fix a few bugs * improving performance & bug fixing * add test * several improvements & fixing * fix relative import * [async] change code format, remove old files * [async] Create context wrapper for the eventloop * [async] fix: context should return a value * [async] Implement futures grouping * [async] Fix bugs & replace old functions * [async] Fix bugs found in tests * [async] Implement `PlasmaEpoll` * [async] Make test faster, add tests for epoll * [async] Fix code format * [async] Add comments for main code. * [async] Fix import path. * [async] Fix test. * [async] Compatibility. * [async] less verbose to not annoy the CI. * [async] Add test for new API * [async] Allow showing debug info in some of the test. * [async] Fix test. * [async] Proper shutdown. * [async] Lint~ * [async] Move files to experimental and create API * [async] Use async/await syntax * [async] Fix names & styles * [async] comments * [async] bug fixing & use pytest * [async] bug fixing & change tests * [async] use logger * [async] add tests * [async] lint * [async] type checking * [async] add more tests * [async] fix bugs on waiting a future while timeout. Add more docs. * [async] Formal docs. * [async] Add typing info since these codes are compatible with py3.5+. * [async] Documents. * [async] Lint. * [async] Fix deprecated call. * [async] Fix deprecated call. * [async] Implement a more reasonable way for dealing with pending inputs. * [async] Fix docs * [async] Lint * [async] Fix bug: Type for time * [async] Set our eventloop as the default eventloop so that we can get it through `asyncio.get_event_loop()`. * [async] Update test & docs. * [async] Lint. * [async] Temporarily print more debug info. * [async] Use `Poll` as a default option. * [async] Limit resources. * new async implementation for Ray * implement linked list * bug fix * update * support seamless async operations * update * update API * fix tests * lint * bug fix * refactor names * improve doc * properly shutdown async_api * doc * Change the table on the index page. * Adjust table size. * Only keeps `as_future`. * change how we init connection * init connection in `ray.worker.connect` * doc * fix * Move initialization code into the module. * Fix docs & code * Update pyarrow version. * lint * Restore index.rst * Add known issues. * Apply suggestions from code review Co-Authored-By: suquark <[email protected]> * rename * Update async_api.rst * Update async_api.py * Update async_api.rst * Update async_api.py * Update worker.py * Update async_api.rst * fix tests * lint * lint * replace the magic number

…ient plasma crashes) (ray-project#3484)

…ting ray (ray-project#3483)

…3491) * fix * lint

… overrides (ray-project#3480)

* wip * fix * remove check * fix null * revert * lint and kl * also fix rollout

* Add return value for recontruct RPC. * Fix comment function name

…ct#3499) * wip * wip * format * wip * note * lint * fix * flag * typo * raise timeout * fix * optional get * fix flag * increase timeout in test * update docs * format

ericl and others added 30 commits September 30, 2018 18:36

[rllib] Default to truncate_episodes and add some more config validat…

e4bea8d

…ors (ray-project#2967) * update * link it * warn about truncation * fix * Update rllib-training.rst * deprecate tests failing

[rllib] Propagate model options correctly in ARS / ES, to action dist…

b45bed4

… of PPO (ray-project#2974) * fix * fix * fix it * propagate conf to action dist * move carla example too * rr * Update policies.py * wip * lint

[Java] Fix the required-resources issue of actor member function in J…

fcef4ed

…ava worker. (ray-project#3002) This fixes a bug in which Java actor methods inherit the resource requirements of the actor creation task.

[rllib] Remove legacy multiagent support (ray-project#2975)

2019b41

* remove legacy * remove reshaper

fix bug: (ray-project#3000)

9c606ea

before fix,RAY_FUN_CACHE use only get method ,can only get null fix : put after create

Change logfile names and also allow plasma store socket to be passed …

cc7e2ec

…in. (ray-project#2862)

Update links to use latest 0.5.3 wheels instead of 0.5.2. (ray-projec…

d73ee36

…t#3018)

Move function/actor exporting & loading code to function_manager.py (r…

9948e8c

…ay-project#3003) Move function/actor exporting & loading code to function_manager.py to prepare the code change for function descriptor for python.

Suppress errors when worker or driver intentionally disconnects. (ray…

01bb073

…-project#2935)

Introduce concept of resources required for placing a task. (ray-proj…

faa31ae

…ect#2837) * Introduce concept of resources required for placement. * Add placement resources to task spec * Update java worker * Update taskinfo.java

[tune/core] Use Global State API for resources (ray-project#3004)

0651d3b

[core] Improve logging message when plasma store is started. (ray-pro…

ecd8f39

…ject#3029) Improve logging message when plasma store is started.

Bug/log syncer fails with parentheses (ray-project#2653)

2d35a97

* Update rsync command * Escape rsync locations * Fix the accidental variable move * Update rsync to use -s flag

Fix the uniqueId toString format. (ray-project#3035)

ef1f2fd

[Java] Improve some Java code (ray-project#3040)

4a2ed47

This PR improves some java codes, and removes some duplicated code.

[tune] Tweaks to Trainable and Verbosity (ray-project#2889)

f9b58d7

move make clean before cmake command, avoid always running mvn instal…

87639b9

…l plasma java lib (ray-project#3047)

[rllib] Add unit test and some better error messages for custom polic…

473ee4e

…y states (ray-project#3032)

[rllib] Don't crash printing out error message (ray-project#3054)

866c7a5

* fix er * update

[tune] Fix misleading comment (ray-project#3058)

4dc78b7

[rllib] Parallel-data loading and multi-gpu support for IMPALA (ray-p…

3c891c6

…roject#2766)

[rllib] Add more warnings when multi-agent envs might not be set up r…

6240ccb

…ight (ray-project#3061)

[Java] Add jvm-parameters in Config. (ray-project#3065)

64e5eb3

GiliR4t1qbit and others added 30 commits November 30, 2018 16:39

[docs] Snippet did not have a code-block tag above it (ray-project#3442)

454d3aa

Update readme to contain logo (ray-project#3443)

5751261

* Adding logo to readme * Updating link * Add badge * Addressing comments * Moving logo * Change align * Move image

Bump version from 0.5.3 to 0.6.0. (ray-project#3420)

0603e0b

Add stress test for Java worker (ray-project#3424)

abd37df

Upgrade Arrow to include Plasma TensorFlow Op release fix (ray-projec…

c5b5cda

…t#3448) This includes a fix so the TensorFlow op releases memory properly (apache/arrow#3061) and the possibility to store arrow data structures in plasma (apache/arrow#2832). ray-project#3404

Update README.rst with 0.6.0 version number. (ray-project#3453)

13c8ce4

[rllib] Better error message for unsupported non-atari image observat…

7abfbfd

…ion sizes (ray-project#3444)

[rllib] Auto clip actions to Box space range; deprecate squash_to_ran…

d820597

…ge (ray-project#3426) * fix clip * tweak wording * remove squash entirely * Update rllib-models.rst * fix argument order * Apply suggestions from code review Co-Authored-By: ericl <[email protected]>

Tweak/exec attach info (ray-project#3447)

be6567e

* Add custom cluster name to exec info * Update submit info to match exec info

[rllib] Allow envs to be auto-registered; add on_train_result callbac…

ce355d1

…k with curriculum example (ray-project#3451) * train step and docs * debug * doc * doc * fix examples * fix code * integration test * fix * ... * space * instance * Update .travis.yml * fix test

[tune] Component notification on node failure + Tests (ray-project#3414)

9d0bd50

Changes include: - Notify Components on Requeue - Slight refactoring of Node Failure handling - Better tests

[docs] Switch docs to use rllib train instead of train.py

93a9d32

Make test_actor_multiple_gpus_from_multiple_tasks less stressful in t…

06f6431

…ravis

increase container memory and shm to 20G (ray-project#3475)

7a79b7f

* increase container memory and shm to 20G * variables are POWERFUL

[rllib] fixes from dogfooding multi-agent (ray-project#3456)

d864f29

auto wrap multi-agent dict and tuple spaces by keeping a policy -> preprocessor in the sampler add some Q-learning debug stats report min, max of custom metrics better errors

[tune] Deprecate ambiguous function values (use tune.function / tune.…

412aaa5

…sample_from instead) (ray-project#3457) * wip * exclude

Removing the check about the size re: ray-project#3450 (ray-project#3464

970babf

) * Removing the check about the size re: ray-project#3450 * Addressing comments * Update services.py

[rllib] Copy data before passing to Ape-X learner thread (fixes trans…

8395523

…ient plasma crashes) (ray-project#3484)

Resolve no handlers could be found for logger 'ray.worker' when impor…

f6490f9

…ting ray (ray-project#3483)

[rllib] Use smoothed version of collect metrics for DQN (ray-project#…

462e6ef

…3491) * fix * lint

[rllib] Better document which methods are abstract and which ones are…

8b5827b

… overrides (ray-project#3480)

[rllib] Multi-GPU support for Multi-Agent PPO (ray-project#3479)

7aec357

* wip * fix * remove check * fix null * revert * lint and kl * also fix rollout

Add return value for recontruction RPC. (ray-project#3493)

0136af5

* Add return value for recontruct RPC. * Fix comment function name

Add option to evict keys LRU from the sharded redis tables (ray-proje…

cffe8f9

…ct#3499) * wip * wip * format * wip * note * lint * fix * flag * typo * raise timeout * fix * optional get * fix flag * increase timeout in test * update docs * format

changed cmake to make it build correctly

ce606a9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ray master #3

Ray master #3

eugenevinitsky commented Dec 9, 2018

Ray master #3

Are you sure you want to change the base?

Ray master #3

Conversation

eugenevinitsky commented Dec 9, 2018