Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unity device mapping algorithm #1459

Merged
merged 35 commits into from
Oct 8, 2024
Merged

Conversation

wmdi
Copy link
Collaborator

@wmdi wmdi commented Aug 7, 2024

Description of changes:
This PR recovers the device mapping algorithm (the DP-based algorithm) used in Unity.

Related Issues:

Linked Issues:

  • Issue #

Issues closed by this PR:

  • Closes #

This change is Reviewable

@wmdi wmdi requested a review from lockshaw August 7, 2024 20:06
Copy link

codecov bot commented Aug 7, 2024

Codecov Report

Attention: Patch coverage is 87.15596% with 308 lines in your changes missing coverage. Please review.

Project coverage is 75.95%. Comparing base (cf96db6) to head (a2b8832).
Report is 1 commits behind head on repo-refactor.

Files with missing lines Patch % Lines
...graph/computation_graph_binary_sp_decomposition.cc 0.00% 94 Missing ⚠️
...ler/machine_mapping/machine_mapping_constraints.cc 55.76% 23 Missing ⚠️
lib/compiler/src/compiler/graph_optimize_state.cc 53.19% 22 Missing ⚠️
...compiler/machine_mapping/transitive_reduced_pcg.cc 69.56% 14 Missing ⚠️
lib/pcg/src/pcg/computation_graph.cc 0.00% 13 Missing ⚠️
...clude/utils/full_binary_tree/get_subtree_at_path.h 0.00% 11 Missing ⚠️
...b/utils/src/utils/any_value_type/any_value_type.cc 0.00% 11 Missing ⚠️
...ler/machine_mapping/get_optimal_machine_mapping.cc 92.95% 10 Missing ⚠️
...b/utils/include/utils/full_binary_tree/get_child.h 0.00% 8 Missing ⚠️
...el_computation_graph/parallel_computation_graph.cc 78.78% 7 Missing ⚠️
... and 33 more
Additional details and impacted files
@@                Coverage Diff                @@
##           repo-refactor    #1459      +/-   ##
=================================================
+ Coverage          73.22%   75.95%   +2.73%     
=================================================
  Files                731      777      +46     
  Lines              23398    25487    +2089     
  Branches             690      720      +30     
=================================================
+ Hits               17133    19359    +2226     
+ Misses              6265     6128     -137     
Flag Coverage Δ
unittests 75.95% <87.15%> (+2.73%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...r/include/compiler/cost_estimator/cost_estimator.h 100.00% <100.00%> (ø)
...b/compiler/include/compiler/graph_optimize_state.h 100.00% <100.00%> (ø)
...iler/src/compiler/cost_estimator/cost_estimator.cc 100.00% <100.00%> (ø)
...get_abstracted_tensor_set_movement_across_split.cc 100.00% <100.00%> (ø)
...er/src/compiler/machine_mapping/machine_mapping.cc 100.00% <100.00%> (ø)
...g_problem_tree/get_machine_mapping_problem_tree.cc 100.00% <100.00%> (ø)
...ping_problem_tree/unmapped_op_cost_estimate_key.cc 100.00% <100.00%> (ø)
...computation_graph_series_parallel_decomposition.cc 100.00% <ø> (ø)
...ler/series_parallel/pcg/pcg_binary_series_split.cc 100.00% <100.00%> (ø)
...ompiler/machine_mapping/cost_estimator_for_test.cc 100.00% <100.00%> (ø)
... and 99 more

... and 33 files with indirect coverage changes

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add compiler-tests to CI and proj?

Also, overall this code could use a bit of simplifying and readability-improving. For a lot of the cases where you're unwrapping types consider just defining equivalent functions for those types, which would help simplify things. Overall, more small functions with better names would probably help over the current large functions which can be a bit hard to actually read through

Reviewed 7 of 15 files at r1, 8 of 8 files at r2, all commit messages.
Reviewable status: all files reviewed, 14 unresolved discussions (waiting on @wmdi)


lib/compiler/include/compiler/graph_optimize_result.struct.toml line 7 at r2 (raw file):

includes = [ 
  "compiler/machine_mapping.h"
]

Suggestion:

includes = [
  "compiler/machine_mapping.dtg.h",
  "pcg/parallel_computation_graph/parallel_computation_graph.dtg.h",
]

lib/compiler/include/compiler/graph_optimize_state.h line 13 at r2 (raw file):

  float runtime;

  bool operator==(GraphOptimizeState const &other) const;

GraphOptimizeState::operator== should probably have tests


lib/compiler/include/compiler/unity_algorithm.h line 13 at r2 (raw file):

namespace FlexFlow {

struct OptimizerConfig {

Move over to dtgen so it's printable, equality-checkable, hashable, etc.?


lib/compiler/src/graph_optimize_state.cc line 6 at r2 (raw file):

bool GraphOptimizeState::operator==(GraphOptimizeState const &other) const {
  auto layers1 = topological_ordering(graph_optimize_result.pcg);

I'm not sure topological_ordering is the best thing to rely on here. This seems like a graph-level comparison function, so it might be better to figure out what the semantics are and then it can get added to graph so you only have to use it here?


lib/compiler/src/graph_optimize_state.cc line 30 at r2 (raw file):

  return !(*this == other);
}

Where is operator<? I think it was declared?


lib/compiler/src/machine_mapping.cc line 159 at r2 (raw file):

        cached_subgraph_costs(cached_subgraph_costs) {}

  ParallelComputationGraph pcg;

Why move the PCG from OptimalCostFunctor to MachineMappingSearcher?


lib/compiler/src/machine_mapping.cc line 166 at r2 (raw file):

  OptimalCostCache &cached_subgraph_costs;

  struct OptimalCostFunctor {

Why do we have this separate OptimalCostFunctor?


lib/compiler/src/machine_mapping.cc line 219 at r2 (raw file):

    std::unordered_set<DataflowOutput> split_outputs;
    for (auto const &[value, _] :

Might be simpler with transform? Also the assert isn't really necessary, get will already throw if you perform an invalid access


lib/compiler/src/machine_mapping.cc line 287 at r2 (raw file):

          optimal_result,
          OptimalCostResult::parallel_combine(
              std::visit(OptimalCostFunctor(

Why not just pull out a separate function that does std::visit(OptimalCostFunctor(...)) rather than repeat it each time?


lib/compiler/src/machine_mapping.cc line 320 at r2 (raw file):

           enumerate_machine_views({node}, resource)) {
        MachineMapping mv_map{{{node, node_machine_views.at(node)}}};
        std::unordered_map<OpenDataflowValue, MachineView> machine_views =

Might be cleaner using merge_maps and generate_map?


lib/compiler/src/machine_mapping.cc line 334 at r2 (raw file):

  std::vector<std::unordered_map<Node, MachineView>>
      enumerate_machine_views(std::unordered_set<Node> const &nodes,

Not entirely clear what these do? Can you rename them to something a bit more descriptive?


lib/compiler/test/src/test_optimal_cost.cc line 9 at r2 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("optimal_cost_0") {

Suggestion:

  TEST_CASE("optimal_cost does not crash on minimal inputs") {

lib/substitutions/include/substitutions/sub_parallel_computation_graph.h line 25 at r2 (raw file):

SubParallelComputationGraph
    sub_pcg_from_partial_pcg(ParallelComputationGraph const &,

Is this just get_subgraph for PCGs?


lib/substitutions/src/substitutions/sub_parallel_computation_graph.cc line 61 at r2 (raw file):

                             std::unordered_set<Node> const &nodes) {
  auto as_open = view_as_labelled_open_dataflow_graph(pcg.raw_graph);
  OpenDataflowSubgraphResult subgraph_result = get_subgraph(as_open, nodes);

Probably better to just add get_subgraph for LabelledOpenDataflowGraph and then use it here

Copy link
Collaborator Author

@wmdi wmdi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 14 unresolved discussions (waiting on @lockshaw)


lib/compiler/include/compiler/graph_optimize_result.struct.toml line 7 at r2 (raw file):

includes = [ 
  "compiler/machine_mapping.h"
]

Done.


lib/compiler/include/compiler/graph_optimize_state.h line 13 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

GraphOptimizeState::operator== should probably have tests

Done.


lib/compiler/include/compiler/unity_algorithm.h line 13 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Move over to dtgen so it's printable, equality-checkable, hashable, etc.?

Done.


lib/compiler/src/graph_optimize_state.cc line 6 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I'm not sure topological_ordering is the best thing to rely on here. This seems like a graph-level comparison function, so it might be better to figure out what the semantics are and then it can get added to graph so you only have to use it here?

I think I require homomorphism assuming inputs are ordered.


lib/compiler/src/machine_mapping.cc line 159 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why move the PCG from OptimalCostFunctor to MachineMappingSearcher?

This is to avoid transferring variables that are not changed during DP, i.e., they are 'constants' to the DP, instead of states.


lib/compiler/src/machine_mapping.cc line 166 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why do we have this separate OptimalCostFunctor?

To separate DP states with DP constants (environments).


lib/compiler/src/machine_mapping.cc line 219 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Might be simpler with transform? Also the assert isn't really necessary, get will already throw if you perform an invalid access

I do not find methods to get the set of left/right values of a bidict


lib/compiler/src/machine_mapping.cc line 287 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why not just pull out a separate function that does std::visit(OptimalCostFunctor(...)) rather than repeat it each time?

Done.


lib/compiler/src/machine_mapping.cc line 320 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Might be cleaner using merge_maps and generate_map?

We do not have the utils required for bidict


lib/compiler/src/machine_mapping.cc line 334 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Not entirely clear what these do? Can you rename them to something a bit more descriptive?

Done.


lib/compiler/test/src/test_optimal_cost.cc line 9 at r2 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("optimal_cost_0") {

Done.


lib/substitutions/include/substitutions/sub_parallel_computation_graph.h line 25 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Is this just get_subgraph for PCGs?

Yes

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 9 files at r3, all commit messages.
Reviewable status: 15 of 17 files reviewed, 9 unresolved discussions (waiting on @wmdi)


lib/compiler/include/compiler/optimal_cost_state.struct.toml line 20 at r3 (raw file):

  "utils/hash/unordered_map.h",
  "utils/fmt/unordered_map.h", 
]

Suggestion:

includes = [
  "utils/graph/serial_parallel/serial_parallel_decomposition.dtg.h",
  "pcg/machine_specification.dtg.h",
  "pcg/machine_view.dtg.h",
  "utils/graph/open_dataflow_graph/open_dataflow_edge.dtg.h",
  "utils/graph/open_dataflow_graph/open_dataflow_value.dtg.h",
]

src_includes = [
  "utils/hash/unordered_map.h",
  "utils/fmt/unordered_map.h",
]

lib/compiler/src/graph_optimize_state.cc line 6 at r2 (raw file):

Previously, wmdi (Mengdi Wu) wrote…

I think I require homomorphism assuming inputs are ordered.

Using a hack like this for now is fine, but eventually it would be good to move over to a more principled algorithm implemented in utils/graph that doesn't make correctness tradeoffs and more explicitly uses the properties of DataflowGraph that you're relying on for efficiency. Until that's implemented let's add a comment in the code that clarifies that this is a hack and links to an issue with the plan for moving this over to something more correct


lib/compiler/src/machine_mapping.cc line 159 at r2 (raw file):

Previously, wmdi (Mengdi Wu) wrote…

This is to avoid transferring variables that are not changed during DP, i.e., they are 'constants' to the DP, instead of states.

As discussed in the Wednesday meeting, it should be possible to remove OptimalCostFunctor, especially as we're now on C++17 and can use overload

There seem to be two potential designs (either (1) making optimal_cost a method on MachineMappingSearcher, or (2) making MachineMappingSearcher just a wrapper for the global state (probably renamed to something like DPContext) and passing that around between top-level functions as a mutable reference, i.e., essentially doing the passing of this manually)


lib/compiler/src/machine_mapping.cc line 219 at r2 (raw file):

Previously, wmdi (Mengdi Wu) wrote…

I do not find methods to get the set of left/right values of a bidict

See keys in utils/containers/keys.h


lib/compiler/src/machine_mapping.cc line 320 at r2 (raw file):

Previously, wmdi (Mengdi Wu) wrote…

We do not have the utils required for bidict

Not exactly sure the details of simplifying this, but making it more clear what this is doing would be nice (maybe using functions from utils/containers or pulling out some pieces into separate named lambdas)


lib/compiler/test/src/test_graph_optimize_state.cc line 8 at r3 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("graph_optimize_state:equality") {

Suggestion:

  TEST_CASE("GraphOptimizeState::operator==") {

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 9 files at r3.
Reviewable status: all files reviewed, 13 unresolved discussions (waiting on @wmdi)


lib/compiler/src/graph_optimize_state.cc line 57 at r3 (raw file):

namespace std {

size_t hash<::FlexFlow::GraphOptimizeState>::operator()(

See https://reviewable.io/reviews/flexflow/FlexFlow/1459#-O3kIAym5Yc4ECMISkBW. Eventually it might be good to use a proper graph hash like https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.graph_hashing.weisfeiler_lehman_graph_hash.html#networkx.algorithms.graph_hashing.weisfeiler_lehman_graph_hash


lib/compiler/test/src/test_graph_optimize_state.cc line 44 at r3 (raw file):

                                                  "dense1");

    ParallelComputationGraph pcg = builder.pcg;

Suggestion:

    ParallelComputationGraph pcg1 = [&] {
      ParallelComputationGraphBuilder builder;
      
      
      ParallelTensorShape input_shape =
        ParallelTensorShape{ParallelTensorDims{
                                FFOrdered<ShardParallelDim>{
                                    ShardParallelDim{32, 2},
                                    ShardParallelDim{16, 1},
                                },
                                ReplicaParallelDimSet{
                                    SumDegree{1},
                                    DiscardCopyDegree{1},
                                },
                            },
                            DataType::FLOAT};

      parallel_tensor_guid_t input0 =
        builder.create_input_tensor(input_shape);
      parallel_tensor_guid_t dense0 = builder.dense(input0,
                                                  8);

      parallel_tensor_guid_t dense1 = builder.dense(dense0,
                                                  4);
      return builder.pcg
    }();

lib/compiler/test/src/test_graph_optimize_state.cc line 67 at r3 (raw file):

                                                   "dense0");

    ParallelComputationGraph pcg_ = builder.pcg;

Suggestion:

    ParallelComputationGraph pcg2 = [&] {
      ParallelComputationGraphBuilder builder;  
      
      parallel_tensor_guid_t input0 =
        builder.create_input_tensor(input_shape);
      parallel_tensor_guid_t dense0 = builder.dense(input0,
                                                   8);
                                                   
      return builder.pcg;                          
    }();

lib/compiler/test/src/test_graph_optimize_state.cc line 69 at r3 (raw file):

    ParallelComputationGraph pcg_ = builder.pcg;

    CHECK(GraphOptimizeState(GraphOptimizeResult(pcg, empty_machine_mapping), 0) !=

This doesn't really sufficiently test GraphOptimizeState::operator==--from this I'm not sure (1) what role the machine mappings and the cost play in equality, and (2) whether any graphs compare equal at all, or whether all of them just return not equal. A few more test/sub-cases are needed


lib/compiler/test/src/test_graph_optimize_state.cc line 70 at r3 (raw file):

    CHECK(GraphOptimizeState(GraphOptimizeResult(pcg, empty_machine_mapping), 0) !=
          GraphOptimizeState(GraphOptimizeResult(pcg_, empty_machine_mapping), 0));

Suggestion:

    GraphOptimizeState state1 = GraphOptimizeState{
      GraphOptimizeResult{pcg, empty_machine_mapping},
      0,  
    };
    
    GraphOptimzieState state2 = GraphOptimizeState{
      GraphOptimizeResult{pcg, empty_machine_mapping},
      0,
    };
    
    CHECK(state1 != state2);

lib/substitutions/include/substitutions/sub_parallel_computation_graph.h line 25 at r2 (raw file):

Previously, wmdi (Mengdi Wu) wrote…

Yes

In that case let's rename it to get_pcg_subgraph

Copy link
Collaborator Author

@wmdi wmdi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 41 files reviewed, 13 unresolved discussions (waiting on @lockshaw)


lib/compiler/src/graph_optimize_state.cc line 6 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Using a hack like this for now is fine, but eventually it would be good to move over to a more principled algorithm implemented in utils/graph that doesn't make correctness tradeoffs and more explicitly uses the properties of DataflowGraph that you're relying on for efficiency. Until that's implemented let's add a comment in the code that clarifies that this is a hack and links to an issue with the plan for moving this over to something more correct

Done.


lib/compiler/src/machine_mapping.cc line 159 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

As discussed in the Wednesday meeting, it should be possible to remove OptimalCostFunctor, especially as we're now on C++17 and can use overload

There seem to be two potential designs (either (1) making optimal_cost a method on MachineMappingSearcher, or (2) making MachineMappingSearcher just a wrapper for the global state (probably renamed to something like DPContext) and passing that around between top-level functions as a mutable reference, i.e., essentially doing the passing of this manually)

Done.


lib/compiler/src/machine_mapping.cc line 219 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

See keys in utils/containers/keys.h

Done.


lib/compiler/src/machine_mapping.cc line 320 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Not exactly sure the details of simplifying this, but making it more clear what this is doing would be nice (maybe using functions from utils/containers or pulling out some pieces into separate named lambdas)

Done.


lib/compiler/test/src/test_graph_optimize_state.cc line 8 at r3 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("graph_optimize_state:equality") {

Done.


lib/compiler/test/src/test_graph_optimize_state.cc line 44 at r3 (raw file):

                                                  "dense1");

    ParallelComputationGraph pcg = builder.pcg;

Done.


lib/compiler/test/src/test_graph_optimize_state.cc line 67 at r3 (raw file):

                                                   "dense0");

    ParallelComputationGraph pcg_ = builder.pcg;

Done.


lib/substitutions/include/substitutions/sub_parallel_computation_graph.h line 25 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

In that case let's rename it to get_pcg_subgraph

Done.


lib/compiler/include/compiler/optimal_cost_state.struct.toml line 20 at r3 (raw file):

  "utils/hash/unordered_map.h",
  "utils/fmt/unordered_map.h", 
]

Done.

Copy link
Collaborator Author

@wmdi wmdi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 41 files reviewed, 13 unresolved discussions (waiting on @lockshaw)


lib/compiler/src/graph_optimize_state.cc line 57 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

See https://reviewable.io/reviews/flexflow/FlexFlow/1459#-O3kIAym5Yc4ECMISkBW. Eventually it might be good to use a proper graph hash like https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.graph_hashing.weisfeiler_lehman_graph_hash.html#networkx.algorithms.graph_hashing.weisfeiler_lehman_graph_hash

Done.


lib/compiler/test/src/test_graph_optimize_state.cc line 69 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

This doesn't really sufficiently test GraphOptimizeState::operator==--from this I'm not sure (1) what role the machine mappings and the cost play in equality, and (2) whether any graphs compare equal at all, or whether all of them just return not equal. A few more test/sub-cases are needed

This will be fixed in another PR.


lib/compiler/test/src/test_graph_optimize_state.cc line 70 at r3 (raw file):

    CHECK(GraphOptimizeState(GraphOptimizeResult(pcg, empty_machine_mapping), 0) !=
          GraphOptimizeState(GraphOptimizeResult(pcg_, empty_machine_mapping), 0));

This will be fixed in another PR.


lib/substitutions/src/substitutions/sub_parallel_computation_graph.cc line 61 at r2 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Probably better to just add get_subgraph for LabelledOpenDataflowGraph and then use it here

Done.

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 27 of 37 files at r4, 6 of 7 files at r5, 1 of 2 files at r6, all commit messages.
Reviewable status: 38 of 41 files reviewed, 31 unresolved discussions (waiting on @wmdi)


lib/compiler/include/compiler/machine_mapping/get_optimal_machine_mapping.h line 6 at r6 (raw file):

#include "machine_mapping.h"
#include "machine_mapping_cache.h"
#include "machine_mapping_context.dtg.h"

Suggestion:

#include "compiler/machine_mapping/machine_mapping.h"
#include "compiler/machine_mapping/machine_mapping_cache.h"
#include "compiler/machine_mapping/machine_mapping_context.dtg.h"

lib/compiler/include/compiler/machine_mapping/get_optimal_machine_mapping.h line 16 at r6 (raw file):

    ParallelComputationGraph const &pcg,
    std::function<std::unordered_set<MachineView>(
        ParallelLayerAttrs const &, MachineSpecification const &)> const

This isn't going to allow access to the input and output tensor shapes of the operator, which are probably necessary?

Code quote:

        ParallelLayerAttrs const &, M

lib/compiler/include/compiler/machine_mapping/get_optimal_machine_mapping.h line 23 at r6 (raw file):

MachineMappingResult
    get_optimal_machine_mapping_internal(MachineMappingContext &context,

What is this overload for? Why does it not take a SerialParallelDecomposition?


lib/compiler/include/compiler/machine_mapping/machine_mapping.h line 4 at r6 (raw file):

#define _FLEXFLOW_COMPILER_MACHINE_MAPPING_H

#include "machine_mapping.dtg.h"

Suggestion:

#include "compiler/machine_mapping/machine_mapping.dtg.h"

lib/compiler/include/compiler/machine_mapping/machine_mapping.h line 8 at r6 (raw file):

namespace FlexFlow {

MachineMapping combine(MachineMapping const &, MachineMapping const &);

Suggestion:

MachineMapping combine_disjoint_mappings(MachineMapping const &, MachineMapping const &);

lib/compiler/include/compiler/machine_mapping/machine_mapping_cache.h line 10 at r6 (raw file):

namespace FlexFlow {

class MachineMappingCache {

Does it make sense to have a separate class/object for this, or this not just the std::unordered_map API?


lib/compiler/include/compiler/machine_mapping/machine_mapping_context.struct.toml line 7 at r5 (raw file):

includes = [
  "machine_mapping.dtg.h",

Suggestion:

  "compiler/machine_mapping/machine_mapping.dtg.h",

lib/compiler/include/compiler/machine_mapping/machine_mapping_context.struct.toml line 29 at r6 (raw file):

[[fields]]
name = "cached_subgraph_results"
type = "::FlexFlow::MachineMappingCache"

What about moving MachineMappingCache out of this class so MachineMappingContext can be passed as const & and MachineMappingCache alone gets passed as mutable &?


lib/compiler/include/compiler/machine_mapping/machine_mapping_result.h line 14 at r6 (raw file):

MachineMappingResult get_infinity_machine_mapping_result();

void minimize_runtime(MachineMappingResult &m1, MachineMappingResult const &m2);

Why this vs just calling min or something?

Code quote:

void minimize_runtime(MachineMappingResult &m1, MachineMappingResult const &m2);

lib/compiler/include/compiler/machine_mapping/machine_mapping_result.struct.toml line 9 at r5 (raw file):

includes = [
  "machine_mapping.dtg.h",

Suggestion:

  "compiler/machine_mapping/machine_mapping.dtg.h",

lib/compiler/src/unity_algorithm.cc line 39 at r6 (raw file):

  MachineMappingCache cached_subgraph_costs;
  DeduplicatedPriorityQueue<GraphOptimizeState> candidates;

Would it be cleaner here to, rather than having a custom GraphOptimizeState type whose hash and operator== ignore the runtime cost, instead have this be a DeduplicatedPriorityQueue<SomeWrapperAroundAPCG, CustomComparator> where the CustomComparator is something like below, where MyCustomType is analagous to SomeWrapperAroundAPCG, and I suppose instead of a std::unordered_map<MyCustomType, float> you'd have a MachineMappingCache

Code snippet:

#include <map>
#include <queue>
#include <unordered_map>
#include <stdlib.h>
#include <iostream>

int guid_ctr = 0; 

struct MyCustomType {
  MyCustomType()
    : guid(guid_ctr++) {}

  bool operator==(MyCustomType const &other) const {
    return this->guid == other.guid;
  }

  bool operator!=(MyCustomType const &other) const {
    return this->guid != other.guid;
  }

  int guid;
};

namespace std {

template <>
struct hash<MyCustomType> {
  size_t operator()(MyCustomType const &x) const {
    return x.guid;
  }
};

}

float get_cost(MyCustomType const &) {
  return rand();
}

int main() {
  std::unordered_map<MyCustomType, float> costs;

  auto cached_get_cost = [&](MyCustomType const &x) -> float {
    if (costs.find(x) == costs.end()) {
      costs.insert({x, get_cost(x)});
    }

    return costs.at(x);
  };

  auto my_custom_comparator = [&](MyCustomType const &lhs, MyCustomType const &rhs) -> bool {
    return cached_get_cost(lhs) < cached_get_cost(rhs);
  };

  std::priority_queue<MyCustomType, std::vector<MyCustomType>, decltype(my_custom_comparator)> q{my_custom_comparator};

  for (int i = 0; i < 10; i++) {
    q.push(MyCustomType{});
  }

  while (!q.empty()) {
    MyCustomType curr = q.top();
    q.pop();

    std::cout << cached_get_cost(curr) << std::endl;
  }
}

lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 81 at r6 (raw file):

  return enumeration;
}

Move to a separate allowed_machine_mappings file, this file is already getting rather complicated/large

Code quote:

std::vector<std::unordered_map<Node, MachineView>>
    allowed_machine_mappings(MachineMappingContext const &context,
                             std::unordered_set<Node> const &nodes,
                             MachineSpecification const &resource) {
  if (nodes.empty()) {
    return {{}};
  }
  Node node = get_first(nodes);
  std::vector<std::unordered_map<Node, MachineView>> partial_enumeration =
      allowed_machine_mappings(context, set_minus(nodes, {node}), resource);
  std::unordered_set<MachineView> allowed_machine_views_for_node =
      context.allowed_machine_views(context.pcg.raw_graph.at(node), resource);
  std::vector<std::unordered_map<Node, MachineView>> enumeration;
  for (MachineView const &mv : allowed_machine_views_for_node) {
    for (std::unordered_map<Node, MachineView> const &partial :
         partial_enumeration) {
      enumeration.push_back(merge_maps(
          partial, std::unordered_map<Node, MachineView>{{node, mv}}));
    }
  }
  return enumeration;
}

std::vector<std::unordered_map<DataflowOutput, MachineView>>
    allowed_machine_mappings(MachineMappingContext const &context,
                             std::unordered_set<DataflowOutput> const &values,
                             MachineSpecification const &resource) {
  std::unordered_set<Node> nodes;
  for (DataflowOutput const &v : values) {
    nodes.insert(v.node);
  }

  std::vector<std::unordered_map<Node, MachineView>> node_enumeration =
      allowed_machine_mappings(context, nodes, resource);
  std::vector<std::unordered_map<DataflowOutput, MachineView>> enumeration;

  for (std::unordered_map<Node, MachineView> _node_enumeration :
       node_enumeration) {
    std::unordered_map<DataflowOutput, MachineView> _emumeration;
    for (DataflowOutput const &v : values) {
      _emumeration.emplace(v, _node_enumeration.at(v.node));
    }
    enumeration.push_back(_emumeration);
  }

  return enumeration;
}

lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 187 at r6 (raw file):

  GraphSplit graph_split = get_graph_split(decompn1, decompn2);

  OpenDataflowSubgraphResult subgraph_res1 = get_subgraph(

Is there really a point to the subgraphs at this point, or can we just rely on the SP decomposition for this information? I guess it's that the graph cost is cached, and if you cached it for the SP decomposition then you wouldn't be agnostic to different locations of the same graph?


lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 192 at r6 (raw file):

      sub_pcg_from_full_pcg(context.pcg).raw_graph, graph_split.second);

  std::unordered_set<DataflowOutput> split_outputs = transform(

FYI you can use parallel_tensor_guid_t for this once #1471 is merged

Code quote:

DataflowOutput

lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 199 at r6 (raw file):

           &split_machine_views :
       allowed_machine_mappings(context, split_outputs, resource)) {
    std::unordered_map<OpenDataflowValue, MachineView> fixed_machine_views1 =

FYI you can use open_parallel_tensor_guid_t for this once #1471 is merged

Code quote:

OpenDataflowValue

lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 206 at r6 (raw file):

                      get_open_dataflow_values(subgraph_res2.graph));

    for (auto const &[split_value, split_input] :

Suggestion:

    for (auto const &[full_graph_value, subgraph_input] :

lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 283 at r6 (raw file):

    assert(contains(
        context.allowed_machine_views(context.pcg.raw_graph.at(node), resource),
        fixed_machine_views.at(any_output)));

Simplify

Code quote:

    assert(contains(
        context.allowed_machine_views(context.pcg.raw_graph.at(node), resource),
        fixed_machine_views.at(any_output)));

lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 302 at r6 (raw file):

                                   return OpenDataflowValue(o);
                                 }),
                       [&](OpenDataflowValue const &o) { return mv; });

Simplify, probably separate into two expressions

Code quote:

      std::unordered_map<OpenDataflowValue, MachineView> output_mv_map =
          generate_map(transform(get_outputs(context.pcg.raw_graph, node),
                                 [](DataflowOutput const &o) {
                                   return OpenDataflowValue(o);
                                 }),
                       [&](OpenDataflowValue const &o) { return mv; });

lib/compiler/src/compiler/machine_mapping/machine_mapping_cache.cc line 10 at r6 (raw file):

  if (contains_key(cache, state)) {
    MachineMappingResult result = cache.at(state);
    return std::make_optional(result);

Suggestion:

    return result;

lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc line 100 at r6 (raw file):

    }();

    ParallelComputationGraph pcg_non_sp = [&] {

These should be in the SUBCASEs as they're not shared between them


lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc line 146 at r6 (raw file):

      CHECK(bool(result.runtime > 0));
      // TODO(@Mengdi Wu): fill it with actual cost
      // CHECK(result.runtime == xx);

Rather than running it, getting a cost, and then just hardcoding that into the file, I'd focus on generating testcases (such as restricting the allowed machine views to only a few, controlling the cost function, etc.) where you can actually set analytically derived values


lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc line 171 at r6 (raw file):

    }

    SUBCASE("PCG is not sp-izable due to multiple inputs") {

You'd want separate testcases for this too that actually check the graph manipulation, not just get_optimal_machine_mapping-level tests


lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc line 0 at r6 (raw file):
Rename machine_mapping.cc


lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc line 18 at r6 (raw file):

    MachineMapping result = combine(machine_mapping_0, machine_mapping_1);
    CHECK(result == combined);
  }

Suggestion:

  TEST_CASE("combine(MachineMapping, MachineMapping)") {
    MachineView machine_view_0 = make_1d_machine_view(gpu_id_t(0), gpu_id_t(1));
    MachineView machine_view_1 = make_1d_machine_view(gpu_id_t(0), gpu_id_t(2));
    
    MachineMapping machine_mapping_0 = MachineMapping{{
      {Node(0), machine_view_0},
    }};
    MachineMapping machine_mapping_1 = MachineMapping{{
      {Node(1), machine_view_1},
    }};
    
    MachineMapping correct = MachineMapping{{
        {Node(0), machine_view_0}, 
        {Node(1), machine_view_1},
    }};
    
    MachineMapping result = combine(machine_mapping_0, machine_mapping_1);
    
    CHECK(result == correct);
  }

lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc line 20 at r6 (raw file):

  }

  TEST_CASE("nodes_are_disjoint") {

Suggestion:

  TEST_CASE("nodes_are_disjoint(MachineMapping, MachineMapping)") {

lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc line 28 at r6 (raw file):

        {{Node(0), machine_view_0}, {Node(1), machine_view_1}});
    CHECK(nodes_are_disjoint(machine_mapping_0, machine_mapping_1));
    CHECK_FALSE(nodes_are_disjoint(machine_mapping_0, combined));

Suggestion:

    MachineMapping machine_mapping_0({{Node(0), machine_view_0}});
    
    SUBCASE("nodes are disjoint") {
      MachineMapping machine_mapping_1({{Node(1), machine_view_1}});
      
      bool correct = true;
      bool result = nodes_are_disjoint(machine_mapping_0, machine_mapping_1);
      
      CHECK(result == correct);
    }
    
    SUBCASE("nodes are not disjoint") {
      MachineMapping machine_mapping_1(
        {{Node(0), machine_view_0}, {Node(1), machine_view_1}});
        
      bool correct = false;
      bool result = nodes_are_disjoint(machine_mapping_0, machine_mapping_1);
      
      CHECK(result == correct);
    }

lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_cache.cc line 11 at r6 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("machine_mapping_cache") {

Suggestion:

  TEST_CASE("MachineMappingCache") {

lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_result.cc line 77 at r6 (raw file):

    CHECK(s1.runtime < inf.runtime);
    CHECK(s2.runtime < inf.runtime);
  }

Delete

Code quote:

  TEST_CASE("get_infinity_machine_mapping_result") {
    MachineView machine_view_0 = make_1d_machine_view(gpu_id_t(0), gpu_id_t(1));
    MachineView machine_view_1 = make_1d_machine_view(gpu_id_t(0), gpu_id_t(2));
    MachineMapping machine_mapping_empty(
        std::unordered_map<Node, MachineView>{});
    MachineMapping machine_mapping_0({{Node(0), machine_view_0}});
    MachineMapping machine_mapping_1({{Node(1), machine_view_1}});
    MachineMapping combined(
        {{Node(0), machine_view_0}, {Node(1), machine_view_1}});
    MachineMappingResult s0(0, machine_mapping_empty);
    MachineMappingResult s1(1, machine_mapping_0);
    MachineMappingResult s2(2, machine_mapping_1);

    MachineMappingResult inf = get_infinity_machine_mapping_result();
    CHECK(s0.runtime < inf.runtime);
    CHECK(s1.runtime < inf.runtime);
    CHECK(s2.runtime < inf.runtime);
  }

lib/substitutions/include/substitutions/sub_parallel_computation_graph.h line 25 at r6 (raw file):

SubParallelComputationGraph get_pcg_subgraph(ParallelComputationGraph const &,
                                             std::unordered_set<Node> const &);

Suggestion:

                                             std::unordered_set<parallel_layer_guid_t> const &);

lib/compiler/include/compiler/machine_mapping/machine_mapping.struct.toml line 17 at r6 (raw file):

  "utils/hash/unordered_map.h",
  "utils/fmt/unordered_map.h", 
]

Suggestion:

includes = [
  "utils/graph/node/node.dtg.h",
  "pcg/machine_view.dtg.h",
]  
 
src_includes = [   
  "utils/hash/unordered_map.h",
  "utils/fmt/unordered_map.h",
]

lib/compiler/include/compiler/machine_mapping/machine_mapping.struct.toml line 21 at r6 (raw file):

[[fields]]
name = "machine_views"
type = "std::unordered_map<::FlexFlow::Node, ::FlexFlow::MachineView>"

Suggestion:

type = "std::unordered_map<::FlexFlow::parallel_layer_guid_t, ::FlexFlow::MachineView>"

Copy link
Collaborator Author

@wmdi wmdi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 38 of 41 files reviewed, 31 unresolved discussions (waiting on @lockshaw)


lib/compiler/include/compiler/machine_mapping/get_optimal_machine_mapping.h line 6 at r6 (raw file):

#include "machine_mapping.h"
#include "machine_mapping_cache.h"
#include "machine_mapping_context.dtg.h"

Done.


lib/compiler/include/compiler/machine_mapping/get_optimal_machine_mapping.h line 16 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

This isn't going to allow access to the input and output tensor shapes of the operator, which are probably necessary?

I see. Will fix it when merging the machine view PR.


lib/compiler/include/compiler/machine_mapping/get_optimal_machine_mapping.h line 23 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

What is this overload for? Why does it not take a SerialParallelDecomposition?

This is to further split the logics into multiple pieces. get_optimal_machine_mapping wraps global variables as MachineMappingContext and calls this function, and this function performs sp decomposition (and solves the un-sp-izable issue in a later version) and calls other overloads.


lib/compiler/include/compiler/machine_mapping/machine_mapping.h line 4 at r6 (raw file):

#define _FLEXFLOW_COMPILER_MACHINE_MAPPING_H

#include "machine_mapping.dtg.h"

Done.


lib/compiler/include/compiler/machine_mapping/machine_mapping.h line 8 at r6 (raw file):

namespace FlexFlow {

MachineMapping combine(MachineMapping const &, MachineMapping const &);

Done.


lib/compiler/include/compiler/machine_mapping/machine_mapping_cache.h line 10 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Does it make sense to have a separate class/object for this, or this not just the std::unordered_map API?

I think it makes sense to wrap up the save and load logic as they handle some existence cases.


lib/compiler/include/compiler/machine_mapping/machine_mapping_context.struct.toml line 7 at r5 (raw file):

includes = [
  "machine_mapping.dtg.h",

Done.


lib/compiler/include/compiler/machine_mapping/machine_mapping_result.h line 14 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why this vs just calling min or something?

This function only compares MachineMappingResults according to the runtime. It does not make sense to have an operator< for MachineMappingResult that only compares runtime, which is required by min.


lib/compiler/include/compiler/machine_mapping/machine_mapping_result.struct.toml line 9 at r5 (raw file):

includes = [
  "machine_mapping.dtg.h",

Done.


lib/compiler/src/compiler/machine_mapping/machine_mapping_cache.cc line 10 at r6 (raw file):

  if (contains_key(cache, state)) {
    MachineMappingResult result = cache.at(state);
    return std::make_optional(result);

Done.


lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc line 18 at r6 (raw file):

    MachineMapping result = combine(machine_mapping_0, machine_mapping_1);
    CHECK(result == combined);
  }

Done.


lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc line 20 at r6 (raw file):

  }

  TEST_CASE("nodes_are_disjoint") {

Done.


lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc line 28 at r6 (raw file):

        {{Node(0), machine_view_0}, {Node(1), machine_view_1}});
    CHECK(nodes_are_disjoint(machine_mapping_0, machine_mapping_1));
    CHECK_FALSE(nodes_are_disjoint(machine_mapping_0, combined));

Done.


lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc line at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Rename machine_mapping.cc

What does this mean?


lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_cache.cc line 11 at r6 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("machine_mapping_cache") {

Done.


lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_result.cc line 77 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Delete

Done.


lib/compiler/include/compiler/machine_mapping/machine_mapping.struct.toml line 17 at r6 (raw file):

  "utils/hash/unordered_map.h",
  "utils/fmt/unordered_map.h", 
]

Done.

Copy link
Collaborator Author

@wmdi wmdi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 38 of 41 files reviewed, 31 unresolved discussions (waiting on @lockshaw)


lib/compiler/include/compiler/machine_mapping/machine_mapping_context.struct.toml line 29 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

What about moving MachineMappingCache out of this class so MachineMappingContext can be passed as const & and MachineMappingCache alone gets passed as mutable &?

I think MachineMappingCache should be part of the context, and it is fine to have a mutable context.


lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 81 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Move to a separate allowed_machine_mappings file, this file is already getting rather complicated/large

Done.

Copy link
Collaborator Author

@wmdi wmdi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 38 of 41 files reviewed, 31 unresolved discussions (waiting on @lockshaw)


lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 206 at r6 (raw file):

                      get_open_dataflow_values(subgraph_res2.graph));

    for (auto const &[split_value, split_input] :

Done.


lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 283 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Simplify

Done.


lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 302 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Simplify, probably separate into two expressions

Done.


lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc line 100 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

These should be in the SUBCASEs as they're not shared between them

Done.


lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc line 146 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Rather than running it, getting a cost, and then just hardcoding that into the file, I'd focus on generating testcases (such as restricting the allowed machine views to only a few, controlling the cost function, etc.) where you can actually set analytically derived values

In terms of actual cost, I mean the cost model we have for a pcg. Even for analytical results, we need a precise cost model.


lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc line 171 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

You'd want separate testcases for this too that actually check the graph manipulation, not just get_optimal_machine_mapping-level tests

Done.


lib/substitutions/include/substitutions/sub_parallel_computation_graph.h line 25 at r6 (raw file):

SubParallelComputationGraph get_pcg_subgraph(ParallelComputationGraph const &,
                                             std::unordered_set<Node> const &);

Why?

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 37 files at r4, 7 of 17 files at r7, all commit messages.
Reviewable status: 33 of 43 files reviewed, 24 unresolved discussions (waiting on @wmdi)


lib/compiler/include/compiler/machine_mapping/allowed_machine_mappings.h line 12 at r7 (raw file):

    allowed_machine_mappings(MachineMappingContext const &context,
                             std::unordered_set<Node> const &nodes,
                             MachineSpecification const &resource);

Suggestion:

std::vector<std::unordered_map<parallel_layer_guid_t, MachineView>>
    allowed_machine_mappings(MachineMappingContext const &context,
                             std::unordered_set<parallel_layer_guid_t> const &nodes,
                             MachineSpecification const &resource);

lib/compiler/include/compiler/machine_mapping/allowed_machine_mappings.h line 16 at r7 (raw file):

std::vector<std::unordered_map<DataflowOutput, MachineView>>
    allowed_machine_mappings(MachineMappingContext const &context,
                             std::unordered_set<DataflowOutput> const &values,

Suggestion:

                            std::unordered_set<parallel_tensor_guid_t> const &values,

lib/compiler/test/src/test_graph_optimize_state.cc line 70 at r3 (raw file):

Previously, wmdi (Mengdi Wu) wrote…

This will be fixed in another PR.

Isn't this a trivial fix?


lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc line 146 at r6 (raw file):

Previously, wmdi (Mengdi Wu) wrote…

In terms of actual cost, I mean the cost model we have for a pcg. Even for analytical results, we need a precise cost model.

In a way yes, but you control the cost model: you can add a custom implementation of CostEstimator that returns whatever numbers you think would be useful for testing. We absolutely do want to (at some point) test the whole compiler+costestimator+machineviews+everything stack, but even if all of that was present we would also want to have tests that just that compiler is correct in isolation, which is what most of these tests should be--provide an arbitrary tiny set of allowed machine views (say, two of them), create a tiny PCG (with, say, 2 operators in parallel), and write a CostEstimator subclass that gives you back some chosen constants (1 for each of the possible node-machineview assignments). Then solve for what the resulting value of get_optimal_machine_mappings should be, run the function, and check that the returned value is the same as what you solved analytically.

At this point there shouldn't be anything stopping you from testing pretty much the whole Unity DP in isolation, and once everything is merged we'll add a couple end-to-end tests just to make sure there weren't any unexpected issues at the boundaries between components


lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc line 12 at r7 (raw file):

    auto allowed_machine_views1 = [&](ParallelLayerAttrs const &,
                                      MachineSpecification const &) {
      // TODO(@Mengdi Wu): Replace it with actual allowed machine views when

Ideally don't do this--these tests are supposed to confirm that get_optimal_machine_mapping is correct, not that get_allowed_machine_views is correct. If you call get_allowed_machine_views in this test and the test fails, you don't know whether it is because get_optimal_machine_mapping is wrong or if it s because get_allowed_machine_views is wrong


lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc line at r6 (raw file):

Previously, wmdi (Mengdi Wu) wrote…

What does this mean?

Rename from test_machine_mapping.cc to machine_mapping.cc--the current convention is to not have the test_ prefix, as they're already under the test directory


lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc line 25 at r7 (raw file):

        combine_disjoint_mappings(machine_mapping_0, machine_mapping_1);
    CHECK(result == correct);
  }

Add a SUBCASE to test the failure case

Suggestion:

  TEST_CASE("combine_disjoint_mappings(MachineMapping, MachineMappping)") {
    MachineView machine_view_0 = make_1d_machine_view(gpu_id_t(0), gpu_id_t(1));
    MachineView machine_view_1 = make_1d_machine_view(gpu_id_t(0), gpu_id_t(2));
    
    SUBCASE("input machine mappings are disjoint") {
      MachineMapping machine_mapping_0 = MachineMapping({
          {Node(0), machine_view_0},
      });
      MachineMapping machine_mapping_1 = MachineMapping({
          {Node(1), machine_view_1},
      });
      MachineMapping correct = MachineMapping({
          {Node(0), machine_view_0},
          {Node(1), machine_view_1},
      });
      MachineMapping result =
          combine_disjoint_mappings(machine_mapping_0, machine_mapping_1);
      CHECK(result == correct);
    }
    
    SUBCASE("input machine mappings are not disjoint") {
      ...
    }
  }

lib/substitutions/include/substitutions/sub_parallel_computation_graph.h line 25 at r6 (raw file):

Previously, wmdi (Mengdi Wu) wrote…

Why?

parallel_layer_guid_t is a wrapper for Node that we use for representing Nodes in a ParallelComputationGraph. It primarily improves readability and helps prevent accidentally passing the wrong node to things

Copy link
Collaborator

@lockshaw lockshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 10 of 17 files at r7.
Reviewable status: all files reviewed, 28 unresolved discussions (waiting on @wmdi)


lib/compiler/include/compiler/machine_mapping/allowed_machine_mappings.h line 10 at r7 (raw file):

std::vector<std::unordered_map<Node, MachineView>>
    allowed_machine_mappings(MachineMappingContext const &context,

The type signatures of these methods are a bit odd? At least it's not 100% clear for me what they're doing based on the names and input types, especially since there are two overloads? Can you either clarify the names a bit and/or add doxygen docstrings giving a brief description of what they do?


lib/compiler/include/compiler/machine_mapping/get_optimal_machine_mapping.h line 23 at r6 (raw file):

Previously, wmdi (Mengdi Wu) wrote…

This is to further split the logics into multiple pieces. get_optimal_machine_mapping wraps global variables as MachineMappingContext and calls this function, and this function performs sp decomposition (and solves the un-sp-izable issue in a later version) and calls other overloads.

I think it's probably better to think of the SP decomposition as coming from outside of the optimal machine mapping algorithm, as with Pietro's stuff it at least seems more like creating an SP-ization of a given PCG is a separate step than just part of optimal machine mapping algorithm, but this isn't a bit deal and can always get refactored later. That said, it seems a bit odd to have this function, which conceptually seems to do entirely different things from the other get_optimal_machine_mapping_internal overloads, have the same name as them, so if you do prefer to use this structure it should at least be renamed to something clearer--ideally I should be able to make a pretty accurate educated guess of what a function does from its name and arguments, and I don't think that's the case here.


lib/compiler/include/compiler/machine_mapping/machine_mapping_context.struct.toml line 29 at r6 (raw file):

Previously, wmdi (Mengdi Wu) wrote…

I think MachineMappingCache should be part of the context, and it is fine to have a mutable context.

Is there a significant downside to moving it out? It would make the code a bit more readable as it would make it more obvious what gets mutated and what doesn't


lib/compiler/src/compiler/machine_mapping/allowed_machine_mappings.cc line 9 at r7 (raw file):

namespace FlexFlow {

std::vector<std::unordered_map<Node, MachineView>>

Might be cleaner to define a machine_mapping.struct.toml wrapper type around std::unordered_map<parallel_layer_guid_t, MachineView> so this turns into the more readable std::vector<MachineMapping>? Also why is this a std::vector and not a std::unordered_set/std::set? Should there be duplicates or does order matter?


lib/compiler/src/compiler/machine_mapping/allowed_machine_mappings.cc line 14 at r7 (raw file):

                             MachineSpecification const &resource) {
  if (nodes.empty()) {
    return {{}};

Why not an empty vector as a return value? Does an empty machine mapping really have a meaning?


lib/compiler/src/compiler/machine_mapping/allowed_machine_mappings.cc line 29 at r7 (raw file):

    }
  }
  return enumeration;

Suggestion:

  Node curr_node = get_first(nodes);
  std::unordered_set<Node> other_nodes = set_minus(nodes, {node});
  
  std::vector<std::unordered_map<Node, MachineView>> other_node_mappings_from_recursion =
      allowed_machine_mappings(context, other_nodes, resource);
      
  std::unordered_set<MachineView> allowed_machine_views_for_curr_node =
      context.allowed_machine_views(context.pcg.raw_graph.at(node), resource);
      
  std::vector<std::unordered_map<Node, MachineView>> result;
  for (MachineView const &for_curr_node : allowed_machine_views_for_curr_node) {
    for (std::unordered_map<Node, MachineView> const &for_other_nodes :
         other_node_mappings_from_recursion) {
           
      MachineMapping merged = merge_disjoint_machine_mappings(
        MachineView{{
          {curr_node, for_cudd_node},
        }},
        for_other_nodes,
      });
      
      enumeration.push_back(merged);
    }
  }
  return result;

lib/compiler/src/compiler/machine_mapping/allowed_machine_mappings.cc line 39 at r7 (raw file):

  for (DataflowOutput const &v : values) {
    nodes.insert(v.node);
  }

Suggestion:

  std::unordered_set<Node> nodes transform(values, [](DataflowOutput const &v) { return v.node; });

lib/compiler/src/compiler/machine_mapping/allowed_machine_mappings.cc line 47 at r7 (raw file):

  for (std::unordered_map<Node, MachineView> _node_enumeration :
       node_enumeration) {
    std::unordered_map<DataflowOutput, MachineView> _emumeration;

Don't use leading underscore, if you need to distinguish it from enumeration it's probably better to come up with a clearer name as enumeration is a bit unclear (what is being enumerated? does this have to do with enumerate from utils/containers/enumerate.h?)

Code quote:

_emumeration;

lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 257 at r7 (raw file):

      std::unordered_map<OpenDataflowValue, MachineView> output_mv_map =
          generate_map(outputs_of_node,
                       [&](OpenDataflowValue const &o) { return mv; });

Suggestion:

                     [&](OpenDataflowValue const &) { return mv; });

lib/compiler/test/src/test_graph_optimize_state.cc line 69 at r3 (raw file):

Previously, wmdi (Mengdi Wu) wrote…

This will be fixed in another PR.

In that case can you create an issue for that and link it here?


lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc line 171 at r6 (raw file):

Previously, wmdi (Mengdi Wu) wrote…

Done.

I just see a comment added? Are we planning to add the multiple-input handling in this PR or in a separate one?


lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_cache.cc line 12 at r7 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("MachineMappingCache") {
    ParallelComputationGraph pcg = [&] {

Why create a PCG here? It looks like the tests don't actually need it--they just need an SPDecomposition object?


lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_cache.cc line 58 at r7 (raw file):

    MachineSpecification machine_spec(1, 1, 1, 1, 1);
    MachineMappingState state0(subgraph0, machine_spec, {});

Prefer curly-brace initialization for consistency with the rest of the FF codebase

Suggestion:

    MachineMappingState state0 = MachineMappingState{
      subgraph0, machine_spec, {}};

lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_cache.cc line 72 at r7 (raw file):

    cache.save(state0, result0);
    CHECK(cache.load(state0).value() == result0);

Break into SUBCASEs with titles that explain what you're testing--right now it's just a sea of checks which makes this test rather difficult to understand


lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_cache.cc line 81 at r7 (raw file):

    CHECK(cache.load(state2) == std::nullopt);

    cache.save(state2, result2);

Include failure cases--what happens if I save a key twice?


lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_result.cc line 21 at r7 (raw file):

    MachineMappingResult s2(2, machine_mapping_1);

    MachineMappingResult result0 = sequential_combine(s0, s1);

Each of these checks/function evaluations should have a SUBCASE with a title explaining what it's testing/why it's there. Right now it's just a bunch of checks, which makes it difficult to understand quickly what high-level properties you're trying to verify with each. If you can't come up with a concise property that's being verified, the test might need some restructuring/redesigning

(and same with the parallel_combine testcase below, and the minimize_runtime testcase below that)

For an example of what I mean, see https://github.com/flexflow/FlexFlow/blob/add-candle-uno-pcg/lib/op-attrs/test/src/op-attrs/ops/concat.cc#L175-L312--each of the failure cases and success cases has a separate SUBCASE with a title indicating what is being checked, or for a simpler example, https://github.com/flexflow/FlexFlow/blob/add-candle-uno-pcg/lib/utils/test/src/utils/containers/try_merge_nondisjoint_unordered_maps.cc#L9-L75


lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_result.cc line 23 at r7 (raw file):

    MachineMappingResult result0 = sequential_combine(s0, s1);
    CHECK(result0.runtime == 1);
    CHECK(result0.machine_mapping == machine_mapping_0);

Suggestion:

    MachineMappingResult result0 = sequential_combine(s0, s1);
    MachineMappingResult correct = MachineMappingResult{machine_mapping_0, 1};

Copy link
Collaborator Author

@wmdi wmdi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 28 unresolved discussions (waiting on @lockshaw)


lib/compiler/include/compiler/machine_mapping/allowed_machine_mappings.h line 12 at r7 (raw file):

    allowed_machine_mappings(MachineMappingContext const &context,
                             std::unordered_set<Node> const &nodes,
                             MachineSpecification const &resource);

Done.


lib/compiler/include/compiler/machine_mapping/allowed_machine_mappings.h line 16 at r7 (raw file):

std::vector<std::unordered_map<DataflowOutput, MachineView>>
    allowed_machine_mappings(MachineMappingContext const &context,
                             std::unordered_set<DataflowOutput> const &values,

Done.


lib/compiler/include/compiler/machine_mapping/get_optimal_machine_mapping.h line 23 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I think it's probably better to think of the SP decomposition as coming from outside of the optimal machine mapping algorithm, as with Pietro's stuff it at least seems more like creating an SP-ization of a given PCG is a separate step than just part of optimal machine mapping algorithm, but this isn't a bit deal and can always get refactored later. That said, it seems a bit odd to have this function, which conceptually seems to do entirely different things from the other get_optimal_machine_mapping_internal overloads, have the same name as them, so if you do prefer to use this structure it should at least be renamed to something clearer--ideally I should be able to make a pretty accurate educated guess of what a function does from its name and arguments, and I don't think that's the case here.

I see. But I don't think it is a problem to have this signature. First, it is only used internally, so it has little effect on other parts. And it actually make the internal implementation more convenient. Second, it actually has the similar functionality as other functions that have the same name but just different format of the inputs. We should name functions according to their functionality instead of the detailed implementation.


lib/compiler/include/compiler/machine_mapping/machine_mapping_context.struct.toml line 29 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Is there a significant downside to moving it out? It would make the code a bit more readable as it would make it more obvious what gets mutated and what doesn't

I feel like we should put global variables in MachineMappingContext as possible and only explicitly pass the variables that belong to DP states. I think it is more clear this way.


lib/compiler/src/unity_algorithm.cc line 39 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Would it be cleaner here to, rather than having a custom GraphOptimizeState type whose hash and operator== ignore the runtime cost, instead have this be a DeduplicatedPriorityQueue<SomeWrapperAroundAPCG, CustomComparator> where the CustomComparator is something like below, where MyCustomType is analagous to SomeWrapperAroundAPCG, and I suppose instead of a std::unordered_map<MyCustomType, float> you'd have a MachineMappingCache

Sounds a good idea and we will leave all the MCMC search part to another PR.


lib/compiler/src/compiler/machine_mapping/allowed_machine_mappings.cc line 9 at r7 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Might be cleaner to define a machine_mapping.struct.toml wrapper type around std::unordered_map<parallel_layer_guid_t, MachineView> so this turns into the more readable std::vector<MachineMapping>? Also why is this a std::vector and not a std::unordered_set/std::set? Should there be duplicates or does order matter?

It is a vector because no deduplication is required here.


lib/compiler/src/compiler/machine_mapping/allowed_machine_mappings.cc line 14 at r7 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why not an empty vector as a return value? Does an empty machine mapping really have a meaning?

It means the only machine mapping for an empty set of input nodes.


lib/compiler/src/compiler/machine_mapping/allowed_machine_mappings.cc line 29 at r7 (raw file):

    }
  }
  return enumeration;

Done.


lib/compiler/src/compiler/machine_mapping/allowed_machine_mappings.cc line 39 at r7 (raw file):

  for (DataflowOutput const &v : values) {
    nodes.insert(v.node);
  }

I don't think we should use transform since it is not a one-to-one mapping but a many-to-one mapping. We should not expect transform to handle many-to-one mapping not matter whether it actually does.


lib/compiler/src/compiler/machine_mapping/allowed_machine_mappings.cc line 47 at r7 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Don't use leading underscore, if you need to distinguish it from enumeration it's probably better to come up with a clearer name as enumeration is a bit unclear (what is being enumerated? does this have to do with enumerate from utils/containers/enumerate.h?)

Done.


lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 187 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Is there really a point to the subgraphs at this point, or can we just rely on the SP decomposition for this information? I guess it's that the graph cost is cached, and if you cached it for the SP decomposition then you wouldn't be agnostic to different locations of the same graph?

Done.


lib/compiler/src/compiler/machine_mapping/get_optimal_machine_mapping.cc line 257 at r7 (raw file):

      std::unordered_map<OpenDataflowValue, MachineView> output_mv_map =
          generate_map(outputs_of_node,
                       [&](OpenDataflowValue const &o) { return mv; });

Done.


lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc line 146 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

In a way yes, but you control the cost model: you can add a custom implementation of CostEstimator that returns whatever numbers you think would be useful for testing. We absolutely do want to (at some point) test the whole compiler+costestimator+machineviews+everything stack, but even if all of that was present we would also want to have tests that just that compiler is correct in isolation, which is what most of these tests should be--provide an arbitrary tiny set of allowed machine views (say, two of them), create a tiny PCG (with, say, 2 operators in parallel), and write a CostEstimator subclass that gives you back some chosen constants (1 for each of the possible node-machineview assignments). Then solve for what the resulting value of get_optimal_machine_mappings should be, run the function, and check that the returned value is the same as what you solved analytically.

At this point there shouldn't be anything stopping you from testing pretty much the whole Unity DP in isolation, and once everything is merged we'll add a couple end-to-end tests just to make sure there weren't any unexpected issues at the boundaries between components

Done.


lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc line 171 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I just see a comment added? Are we planning to add the multiple-input handling in this PR or in a separate one?

In a separate one.


lib/compiler/test/src/compiler/machine_mapping/test_get_optimal_machine_mapping.cc line 12 at r7 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Ideally don't do this--these tests are supposed to confirm that get_optimal_machine_mapping is correct, not that get_allowed_machine_views is correct. If you call get_allowed_machine_views in this test and the test fails, you don't know whether it is because get_optimal_machine_mapping is wrong or if it s because get_allowed_machine_views is wrong

Done.


lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc line at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Rename from test_machine_mapping.cc to machine_mapping.cc--the current convention is to not have the test_ prefix, as they're already under the test directory

Done.


lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping.cc line 25 at r7 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Add a SUBCASE to test the failure case

Done.


lib/compiler/test/src/compiler/machine_mapping/test_machine_mapping_cache.cc line 12 at r7 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why create a PCG here? It looks like the tests don't actually need it--they just need an SPDecomposition object?

Yes, this PCG is only used to create the SPD


lib/substitutions/include/substitutions/sub_parallel_computation_graph.h line 25 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

parallel_layer_guid_t is a wrapper for Node that we use for representing Nodes in a ParallelComputationGraph. It primarily improves readability and helps prevent accidentally passing the wrong node to things

Done.


lib/compiler/include/compiler/machine_mapping/machine_mapping.struct.toml line 21 at r6 (raw file):

[[fields]]
name = "machine_views"
type = "std::unordered_map<::FlexFlow::Node, ::FlexFlow::MachineView>"

Done.

@lockshaw lockshaw self-assigned this Sep 26, 2024
@lockshaw lockshaw added the repo-refactor Topics related to the repo and search refactors label Sep 26, 2024
@lockshaw lockshaw marked this pull request as draft September 26, 2024 04:56
Copy link

@Marsella8 Marsella8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r9, 106 of 150 files at r10, 117 of 145 files at r11, 51 of 60 files at r12, 23 of 23 files at r13, all commit messages.
Reviewable status: all files reviewed, 52 unresolved discussions (waiting on @wmdi)


lib/compiler/include/compiler/machine_mapping/include_unconstrained.struct.toml line 2 at r11 (raw file):

namespace = "FlexFlow"
name = "IncludeUnconstrained"

maybe enum instead? (just an idea)


lib/compiler/include/compiler/machine_mapping/machine_mapping_result.h line 9 at r12 (raw file):

namespace FlexFlow {

[[nodiscard]] MachineMappingResult infeasible_machine_mapping_result();

any strong reason for [[nodiscard]]?


lib/compiler/include/compiler/machine_mapping/machine_mapping_problem_tree/mm_problem_tree_parallel_split_label.struct.toml line 2 at r11 (raw file):

namespace = "FlexFlow"
name = "MMProblemTreeParallelSplitLabel"

why is it empty?


lib/compiler/src/compiler/machine_mapping/get_machine_resource_splits.cc line 27 at r12 (raw file):

    result.insert(std::make_pair(sub_resource2, sub_resource1));
  }

note that we are only partitioning across GPUs here.

Currently, I consider how MachineSpecification fuses GPU and CPU nodes to be a bit awkard (e.g. it doesn't even have a separate intra and inter node bandwitdh per parameters), I think it should instead have as attributes num_nodes, num_devices_per_node, device_type, inter_node_bandwidth, and intra_node_bandwidth, since this way we can ignore the CPU stuff a lot more and only use it when needed.

Also with the i*=2 in this part of the code feels a bit arbitrary?


lib/compiler/src/compiler/machine_mapping/get_tensor_set_movement_across_split.cc line 4 at r13 (raw file):

#include "compiler/machine_mapping/abstracted_tensor_set_movement/abstracted_tensor_set_movement.h"
#include "compiler/machine_mapping/abstracted_tensor_set_movement/get_abstracted_tensor_set_movement_across_split.h"
#include "compiler/machine_mapping/transitive_reduced_pcg.h"

some of this imports seem superfluous


lib/compiler/src/compiler/machine_mapping/machine_mapping_result.cc line 26 at r13 (raw file):

  for (MachineMappingResult const &candidate : candidates) {
    result = minimize_runtime(result, candidate);

optionally foldl, but it's honestly pretty clear as it is


lib/compiler/src/compiler/machine_mapping/machine_mapping_result.cc line 51 at r13 (raw file):

  ParallelLayerGuidObliviousMachineMapping mapping = [&] {
    if (parallel_split_transformation.has_value()

I think this should be if (parallel_split_transformation.has_value()) then evaluate whether it should be LthenR or RthenL, here if the split_transformation has no value it calls it with LthenR


lib/compiler/src/compiler/machine_mapping/machine_mapping_result.cc line 97 at r13 (raw file):

MachineMappingResult minimize_runtime(MachineMappingResult const &maybe_m1,
                                      MachineMappingResult const &maybe_m2) {
  FeasibleMachineMappingResult m1 = ({

This pattern to get the feasible results seems to appear fairly often here, maybe pacakge it into a separate function?


lib/compiler/src/compiler/machine_mapping/machine_mapping_result.cc line 111 at r13 (raw file):

  });

  if (m2.runtime < m1.runtime) {

ternary optionally


lib/compiler/test/src/compiler/machine_mapping/machine_mapping_problem_tree/get_machine_mapping_problem_tree.cc line 63 at r13 (raw file):

      UnmappedOpCostEstimateKey input_key = make_input_key(input_shape);

      PCGBinarySPDecomposition sp_decomposition = \

remove

Code quote:

\

lib/local-execution/src/ops/element_binary.h line 1 at r13 (raw file):

#ifndef _FLEXFLOW_ELEMENT_BINARY_H

change for all src/ops/ to also include the path? not a big deal

Code quote:

_FLEXFLOW_ELEMENT_BINARY_H

lib/local-execution/src/ops/layer_norm.h line 1 at r13 (raw file):

#ifndef _FLEXFLOW_RUNTIME_SRC_OPS_LAYER_NORM_H

LOCAL_EXECUTION

Code quote:

RUNTIME

lib/utils/include/utils/full_binary_tree/find_paths_to_leaf.h line 16 at r13 (raw file):

template <typename ParentLabel, typename LeafLabel>
std::unordered_set<BinaryTreePath> find_paths_to_leaf(FullBinaryTree<ParentLabel, LeafLabel> const &tree,

super clean implementation


lib/utils/include/utils/graph/digraph/algorithms/get_edges_from_subgraph_to_subgraph.h line 8 at r13 (raw file):

std::unordered_set<DirectedEdge> get_edges_from_subgraph_to_subgraph(DiGraphView const &,
                                                                     std::unordered_set<Node> const &,

add variable names for clarity


lib/utils/include/utils/graph/series_parallel/binary_sp_decomposition_tree/generic_binary_sp_decomposition_tree/make.h line 7 at r13 (raw file):

namespace FlexFlow {

maybe a couple of using SomeShorterName = GenericBinarySPDecompositionTree<SeriesLabel, ParallelLabel, LeafLabel> and same for the others?
I know it's not really standard for the codebase, but I think in this specific case it could help readability quite a bit.


lib/utils/include/utils/graph/series_parallel/binary_sp_decomposition_tree/leaf_only_binary_sp_decomposition_tree/leaf_only_binary_parallel_split_label.struct.toml line 2 at r13 (raw file):

namespace = "FlexFlow"
name = "LeafOnlyBinaryParallelSplitLabel"

why empty?


lib/utils/include/utils/graph/series_parallel/binary_sp_decomposition_tree/leaf_only_binary_sp_decomposition_tree/leaf_only_binary_series_split_label.struct.toml line 11 at r13 (raw file):

  "rapidcheck",
]

why empty?


lib/utils/src/utils/graph/dataflow_graph/algorithms/transitive_reduced_dataflow_graph/transitive_reduced_dataflow_graph.cc line 7 at r13 (raw file):

TransitiveReducedDataflowGraphView get_dataflow_graph_transitive_reduction(DataflowGraphView const &g) {
  DiGraphView as_digraph = g;

use an explicit to_digraph to cast it (just so this won't break in case we decide to modify the coercion rules)


lib/utils/test/src/utils/containers/unordered_map_from_pairs.cc line 42 at r13 (raw file):

        {1, "b"},
      };

I think we should throw an error in this case


lib/compiler/src/compiler/machine_mapping/get_allowed_machine_views_list.cc line 44 at r8 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Might be more readable as something like this? We'd need to add a transform_tuples function or something like it that takes in a container of std::tuple<A, B, ...> and a lambda matching std::function<T(A const &, B const &, ...)> and combines the destructuring and transform operations, but I'm sure I or @Marsella8 would be happy to add this to containers as I can think of quite a few places where it would be useful. Not a high priority and the current code is fine as is, just something I was thinking of

yeah seems like a good function to add, let me know if it is needed


lib/compiler/src/compiler/machine_mapping/machine_mapping_constraints.cc line 75 at r13 (raw file):

    } else {
      if (current_machine_view.value() != machine_view) {
        throw mk_runtime_error(fmt::format("with_additional_layer_machine_views received machine view assignment for layer {} "

with_additional_constraints

Code quote:

with_additional_layer_machine_views

Copy link
Collaborator Author

@wmdi wmdi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 15 files at r1, 1 of 9 files at r3, 9 of 37 files at r4, 1 of 17 files at r7, 4 of 21 files at r8, 1 of 3 files at r9, 103 of 150 files at r10, 84 of 145 files at r11, 40 of 60 files at r12, 18 of 23 files at r13, 83 of 83 files at r14, all commit messages.
Reviewable status: all files reviewed, 52 unresolved discussions (waiting on @lockshaw)


lib/compiler/include/compiler/machine_mapping/machine_mapping_context.struct.toml line 29 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I agree up to the point of grouping mutable and non-mutable global variables into the same type as then the mutation properties become unclear (how do I know that your function isn't also mutating the pcg?), especially since only one of the members ever mutable. This is one case where I have a strong preference that we split MachineMappingCache out so that the function signatures remain clear about what is safe from mutation

Done.

@lockshaw lockshaw marked this pull request as ready for review October 5, 2024 18:34
lockshaw
lockshaw previously approved these changes Oct 5, 2024
@lockshaw lockshaw enabled auto-merge (squash) October 5, 2024 18:35
@lockshaw lockshaw self-requested a review October 8, 2024 04:28
@lockshaw lockshaw merged commit b77598c into flexflow:repo-refactor Oct 8, 2024
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
repo-refactor Topics related to the repo and search refactors
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants