VLSU Support #183

aarongchan · 2024-07-18T05:04:51Z

PR Goal:
Implement VLSU instruction support. Current implementation uses the LSU design and adds vector iterations based on VLEN and data width of VLSU.

aarongchan · 2024-07-23T15:04:00Z

@kathlenemagnus
If you could look at the VLSU.cpp (specifically the completeInst_) for context for the rest of this comment when you get the chance. I've updated my code to work around how the LSU handled stores, but the issue I was running into and had to work around can be summarized as:

Stores get retired before "completing" in the LSU. This is an issue because:
- We need to send multiple memory requests through. If on vector store has 16 passes needed, so 16 memory operations, we need to send that many through. However, the wrappers for LoadStoreInfo and MemoryAccessInfo, which contain relevant data to cache, TLB, and MMU status include the InstPtr. This becomes an issue if we have multiple requests, because the InstPtr already has a retired status, but the next requests will try to set the status thus resulting in an error.
- The work around I did was to create a separate status per each LoadStoreInfo wrapper, thus each memory request has it's own status, that is separate from the instruction's.
  A couple questions:
Currently in my design, the memory requests are sent through the VLSU serially, so we have to wait for the first pass to finish before the 2nd pass can begin for the same vector instruction. This is due to the current design of the LSU, because we're only assuming one memory request per instruction, which is the case for LSU, but not VLSU. The benefit of this is that if there are multiple vlsu instructions, they fully utilize the pipeline because while instruction A's memory request is waiting for the cache, instruction B's memory request is at a different stage of the vlsu pipeline. My question is does this type of design make sense? Or should rearchitect this to work differently similar to the uop generator in vlsu that Knute had suggested?
The vlsu uop generator idea would only process one vlsu instruction at a time, but generate lets say 16 memory requests that then queue up in the VLSU. The flow would look like:

vlsu_uop_generator -> generates 16 memory requests for instruction A
vlsu_ldst_queue_ -> takes the memory requests, queues them up, begins processing them through the pipeline
vlsu -> once all memory requests are through, we officially retire instruction A
(repeat for all vlsu instructions)
vlsu_uop_generator -> generates 16 memory requests for instruction B
...

I wanted to check all this with you and Knute before proceeding, because it will change a bit of how the VLSU works compared to the LSU, especially around the setting of instruction status vs creating a LoadStoreInfo status due to the nature of VLSU instructions.

kathlenemagnus · 2024-07-27T13:42:48Z

core/Decode.cpp

@@ -336,6 +336,9 @@ namespace olympia
        // uint32_t unfusedInstsSize = insts->size();

        // Decrement internal Uop Queue credits
+        ILOG(uop_queue_credits_)


Did you mean to check these in?

kathlenemagnus · 2024-07-27T13:44:39Z

core/Inst.hpp

@@ -485,6 +506,12 @@ namespace olympia
            1; // start at 1 because the uop count includes the parent instruction
        uint64_t uop_count_ = 0;
        VCSRs VCSRs_;
+        uint32_t eew_;


I would move these values into the VCSRs structure and rename it to something like VectorConfig or VectorState.

kathlenemagnus · 2024-07-27T13:45:24Z

core/Inst.hpp

+        uint32_t mop_;
+        uint32_t stride_;
+
+        uint32_t vlsu_total_iters_ = 0;


Also might want to consider a decorator like RenameData for LSUData.

kathlenemagnus · 2024-07-27T13:47:04Z

core/Dispatch.cpp

@@ -237,7 +240,7 @@ namespace olympia
                                 "pipe. Did you define it in the yaml properly?");
            // so we have a map here that checks for which valid dispatchers for that
            // instruction target pipe map needs to be: "int": [exe0, exe1, exe2]
-            if (target_pipe != InstArchInfo::TargetPipe::LSU)
+            if (target_pipe != InstArchInfo::TargetPipe::LSU && target_pipe != InstArchInfo::TargetPipe::VLSU)


Can you just check isLoadStoreInst() instead?

kathlenemagnus · 2024-07-27T13:49:43Z

core/VectorUopGenerator.cpp

+        if(uop->isLoadStoreInst()){
+            // set base address according to LMUL, i.e if we're on the 3rd
+            // LMUL Uop, it's base address should be base address + 3 * EEW
+            uop->setTargetVAddr(uop->getTargetVAddr() + uop->getEEW() * uop->getUOpID());


This is fine for now since this PR only supports unit stride, but eventually we should create an "address unroller" class that generates all of the addresses for a vector uop. I think it makes sense to be a part of this class, but it could also be a part of the VLSU.

kathlenemagnus · 2024-07-27T13:55:08Z

core/LoadStoreInstInfo.hpp

+            }
+        }
+
+        void setVectorIter(uint32_t vec_iter){


Open bracket should be on a new line.

kathlenemagnus · 2024-07-27T13:56:33Z

core/LoadStoreInstInfo.hpp

      private:
        MemoryAccessInfoPtr mem_access_info_ptr_;
        sparta::State<IssuePriority> rank_;
        sparta::State<IssueState> state_;
        bool in_ready_queue_;
+        uint32_t vector_iterations_ = 0;


Is this really "completed vector iterations"? Could be renamed to be more clear.

kathlenemagnus · 2024-07-27T13:58:17Z

core/LSU.cpp

@@ -258,17 +259,18 @@ namespace olympia
    {
        sparta_assert(inst_ptr->getStatus() == Inst::Status::RETIRED,
                      "Get ROB Ack, but the store inst hasn't retired yet!");
+        if(!inst_ptr->isVector()){


You should assert if the inst is a vector since you have separate ports for scalar and vector.

kathlenemagnus · 2024-07-27T13:58:48Z

core/LSU.cpp

-        for (auto & inst_info_ptr : ldst_inst_queue_)
-        {
-            if (inst_info_ptr->getInstPtr() == inst_ptr)
+        if(!inst_ptr->isVector()){


Same comment as above. Just assert if vector since vector insts should never end up here.

kathlenemagnus · 2024-07-27T13:59:50Z

core/MemoryAccessInfo.hpp

@@ -100,6 +100,13 @@ namespace olympia
            return inst_ptr == nullptr ? 0 : inst_ptr->getUniqueID();
        }

+        // This is a function which will be added in the SPARTA_ADDPAIRs API.
+        uint64_t getInstUOpID() const


In what scenario is the inst ptr not set?

kathlenemagnus · 2024-07-27T14:00:16Z

core/MemoryAccessInfo.hpp

@@ -151,6 +158,8 @@ namespace olympia
            replay_queue_iterator_ = iter;
        }

+        void setIsVector(bool is_vector){ is_vector_ = is_vector; }
+        bool isVector(){ return is_vector_; }


Can we check the inst ptr for isVector() instead?

kathlenemagnus · 2024-07-27T14:01:00Z

core/ROB.cpp

@@ -73,19 +73,9 @@ namespace olympia
    {
        for (auto & i : *in_reorder_buffer_write_.pullData())
        {
-            if (!i->isUOp())


I changed this code in my most recent PR so you may have conflicts whenever you sync your fork.

kathlenemagnus · 2024-07-28T19:55:52Z

The work around I did was to create a separate status per each LoadStoreInfo wrapper, thus each memory request has it's own status, that is separate from the instruction's.

I think this is the right solution. Each memory access should be completed separately and then a vector instruction can be marked completed if all of its memory accesses have completed.

Currently in my design, the memory requests are sent through the VLSU serially, so we have to wait for the first pass to finish before the 2nd pass can begin for the same vector instruction. This is due to the current design of the LSU, because we're only assuming one memory request per instruction, which is the case for LSU, but not VLSU. The benefit of this is that if there are multiple vlsu instructions, they fully utilize the pipeline because while instruction A's memory request is waiting for the cache, instruction B's memory request is at a different stage of the vlsu pipeline. My question is does this type of design make sense? Or should rearchitect this to work differently similar to the uop generator in vlsu that Knute had suggested?

I would rearchitect like Knute suggested. What we want is for a vector instruction to be able to generate multiple memory accesses. They can be executed serially, but they should be able to be pipelined. So 1 vector instruction should be able to send a uop down the LSU pipeline every cycle. Similarly to how we used to track whether a parent instruction was "done" in the ROB, a vector load or store should expect multiple memory accesses to be completed before it can be marked as done. One thing to be careful of though is that there should only be 1 writeback to the vector destination. So you will need some sort of structure for collecting all of the data returned to the LSU.

The vlsu uop generator idea would only process one vlsu instruction at a time, but generate lets say 16 memory requests that then queue up in the VLSU. The flow would look like:

vlsu_uop_generator -> generates 16 memory requests for instruction A
vlsu_ldst_queue_ -> takes the memory requests, queues them up, begins processing them through the pipeline
vlsu -> once all memory requests are through, we officially retire instruction A
(repeat for all vlsu instructions)
vlsu_uop_generator -> generates 16 memory requests for instruction B

I wanted to check all this with you and Knute before proceeding, because it will change a bit of how the VLSU works compared to the LSU, especially around the setting of instruction status vs creating a LoadStoreInfo status due to the nature of VLSU instructions.

Let's talk about this more on Monday. I see two paths forward here:

We could do as you suggest and "sequence" the vector load store instruction when it gets to the LSU and generate a memory request that is stored in the LDST queue. These memory requests would behave like scalar instructions in the queue. The question here is when to trigger WB. It's inefficient to writeback to the vector destination 16 times and it is not always possible to do partial writes to the register file.
Another option would be to keep the vector instruction in the LDST queue and send it down the LSU pipe multiple times. Each time it gets sent down the pipe it would generate a different memory access. Again, we need a way to track all of the data that has come back and make sure it's all ready before sending the vector instruction down the pipe a final time to do writeback.

…wrapper instead of inst_ptr

aarongchan · 2024-08-05T03:12:04Z

@klingaard @kathlenemagnus this should be ready for review/merge.

klingaard

I'll preemptively approve this as Kathlene will be leading the charge on this PR for now.

Thank you for your contributions to the project, @aarongchan! Best of luck to you at Intel.

klingaard · 2024-07-28T22:37:50Z

core/Inst.hpp

+        uint32_t mop_;
+        uint32_t stride_;


Not initialized..

aarongchan added the enhancement New feature or request label Jul 18, 2024

aarongchan requested a review from kathlenemagnus July 18, 2024 05:04

aarongchan self-assigned this Jul 18, 2024

kathlenemagnus reviewed Jul 27, 2024

View reviewed changes

core/LoadStoreInstInfo.hpp Outdated

}

}

void setVectorIter(uint32_t vec_iter){

Copy link

Collaborator

kathlenemagnus Jul 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open bracket should be on a new line.

kathlenemagnus reviewed Jul 27, 2024

View reviewed changes

aarongchan added 9 commits July 31, 2024 11:57

Last of master merge commit

9027a84

Unit and strided implemented

6510dd2

Fixing bug in Decode, adapting new uop generator code with vlsu

b9afca0

Fixing bugs, credit system should be good

4794361

Working version

8d28028

Updating test yaml

2d2deec

Uop memory generator in VLSU, adding mem request status to loadstore …

6575562

…wrapper instead of inst_ptr

Rebased with master, vlsu_test

1adf72f

Rebased, vlsu test

b371d6f

aarongchan force-pushed the vlsu branch from 918caf8 to b371d6f Compare August 1, 2024 04:37

aarongchan added 2 commits August 1, 2024 00:18

Updating expected output

a8df2e8

Fixing test, removing vaddr ouptut for LSU, only for vector

04b750c

aarongchan marked this pull request as ready for review August 1, 2024 14:50

aarongchan requested a review from kathlenemagnus August 1, 2024 14:51

aarongchan added 5 commits August 1, 2024 12:08

Merge remote-tracking branch 'origin' into vlsu

5fbdc4b

Merging master into branch, restructuring some code

c4e5384

Merge remote-tracking branch 'origin' into vlsu

5ddec1c

Merging new non-blocking-cache changes for VLSU

a0052f5

Cleanup and documentation

5ab2f6b

klingaard self-requested a review August 8, 2024 21:23

klingaard approved these changes Aug 8, 2024

View reviewed changes

core/Inst.hpp Outdated

Comment on lines 510 to 494

uint32_t mop_;

uint32_t stride_;

Copy link

Collaborator

klingaard Jul 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not initialized..

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VLSU Support #183

VLSU Support #183

aarongchan commented Jul 18, 2024

aarongchan commented Jul 23, 2024

kathlenemagnus Jul 27, 2024

kathlenemagnus Jul 27, 2024

kathlenemagnus Jul 27, 2024 •

edited

Loading

kathlenemagnus Jul 27, 2024

kathlenemagnus Jul 27, 2024

kathlenemagnus Jul 27, 2024

kathlenemagnus Jul 27, 2024

kathlenemagnus Jul 27, 2024

kathlenemagnus Jul 27, 2024

kathlenemagnus Jul 27, 2024

kathlenemagnus Jul 27, 2024 •

edited

Loading

kathlenemagnus Jul 27, 2024

kathlenemagnus commented Jul 28, 2024

aarongchan commented Aug 5, 2024

klingaard left a comment

klingaard Jul 28, 2024

VLSU Support #183

Are you sure you want to change the base?

VLSU Support #183

Conversation

aarongchan commented Jul 18, 2024

aarongchan commented Jul 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kathlenemagnus Jul 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kathlenemagnus Jul 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kathlenemagnus commented Jul 28, 2024

aarongchan commented Aug 5, 2024

klingaard left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kathlenemagnus Jul 27, 2024 •

edited

Loading

kathlenemagnus Jul 27, 2024 •

edited

Loading