Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for lists and tuples #51

Open
Lonami opened this issue Oct 7, 2022 · 26 comments
Open

Support for lists and tuples #51

Lonami opened this issue Oct 7, 2022 · 26 comments

Comments

@Lonami
Copy link
Owner

Lonami commented Oct 7, 2022

Discussion from #50:

Lonami

[lists] are tricky because they're no longer a "simple value" which is held in a single variable.

While for x in [a, b, c]: could be special-cased to run the loop body with x going through all of a, b, c, things get more difficult if you were to support to assign the list to a variable, which can then be moved around, passed as parameters, iterated over, etc. Because the type would have to be tracked to properly work on that variable (e.g. in x = y, the assignment would be very different for lists and simple values).

Unless… perhaps "list" variables point to their length, and when used as a list, they "just" work. Imagine

x = [1, 2, 3]
for y in x:
  print(y)

Could generate (pseudo)

x = 3
x__list0 = 1
x__list1 = 2
x__list2 = 3
set loopcounter 0
:loop
jmp loopcounter == 3 to :exitloop
set y x__list{loopcounter}
print y
printflush
jmp :loop
:exitloop

Yeah that's some nice food for thought.

(Now as for what would happen if you had x = 3; for y in x, well, it would do the same, but it would pick out null every time, kinda "undefined behaviour".)

Indieberrie

Have you thought about using heap to allocate arrays?
Allocating arrays on the stack (or using infinite registers) is only good when the array size is known at compile time.
As I see, whole idea of python arrays (I use array and list interchangably) is that they are all runtime resizable which means implementing them on stack is a bad idea

Lonami

The heap (memory cells) can only store numbers (so, no strings, or @variables):

set result "hello"
write result cell1 0
read result cell1 0
print result
printflush message1

This prints 1. Removing the read does print "hello".

It's true, as you point out, infinite variables would not work with dynamic array sizes (since there's no way to refer to a dynamic variable name in mlog).

However, tuples are immutable, so the fixed size makes more sense. Maybe lists could only support numbers (implemented via Mem), and tuples would support anything (implemented via infinite registers). But what memory cell would be used for the list?

foo = [1, 2, 3] @ cell1

could be made work. Ah, so many ideas!

@Indieberrie
Copy link
Contributor

Indieberrie commented Oct 7, 2022

Huh, I learn something new. I didn't know that thing with squary brackets is tuple (not a list) and it's immutable. Thought it was list initialization. I don't know python, I just use it from time to time. I really ought to sit down for a day and learn it properly

Well neither "real" computers saves string into memory. They save number representation. There is a reason why following works in c:

char c1='a',c2='b';
char c3=c1+c2;
printf("%d%c",c3,c3);

Strings are just character arrays, which are actually number arrays. When doing stuff in "lower level" languages than python, like c/cpp, strings are saved in dynamic arrays and have wrapper class to allow certain methods to be done on them.
So this would mean that under the hood strings would have to be saved separately in memory (as separate array) and any class that has string property or array having string elements would actually have a pointer to that string array data structure (either whole object or just address and length). When you think about it, that is how it works in C, that's why strings are just a pointer to char and array of string is pointer to pointer to char

At least that was how would I implement it. I had idea to make C compiler (since I already talk way to much "low" level instead of python) using LLVM, but it's too much work for some game I play once every 2 years

@Lonami
Copy link
Owner Author

Lonami commented Oct 7, 2022

Huh, I learn something new. I didn't know that thing with squary brackets is tuple (not a list) and it's immutable

Sorry I didn't make it clear. That was an example with a list.

my_list = [1, 2, 3]
my_list.append(4)
my_list[3] = 5
my_list.pop()  # 5

my_tuple = (1, 2, 3)  # (parenthesis optional)
# my_tuple[2] = 4  # error! tuples are immutable
# my_tuple.append(4)  # error! tuples do not have .append or .pop

Well neither "real" computers saves string into memory. They save number representation. There is a reason why following works in c:

Yeah, I'm aware. However, unlike C (and most languages), mlog does not really have support to access a string's bytes, or characters, or unicode points. So, not even pyndustric could possibly "compile" storing strings to memory. If mlog had an op to "get character" or something, sure, but unfortunately it doesn't. Even the + operator on two strings results in 2! (I assume they are treated as truthy values, or 1, so the result is 2).

So this would mean that under the hood strings would have to be saved separately in memory (as separate array) and any class that has string property or array having string elements would actually have a pointer to that string array data structure (either whole object or just address and length).

This is simply not possible with the instructions mlog provides in the 6.0 release, and I don't remember anything in the 7.0 which would enable this either.

At least that was how would I implement it. I had idea to make C compiler (since I already talk way to much "low" level instead of python) using LLVM, but it's too much work for some game I play once every 2 years

No need for a C compiler, just a new LLVM backend would be enough, and we'd get all benefits from having an already-existing optimizing compiler. But this is not the issue here (I wouldn't worry that much about hyper-optimizing mlog programs).

@Lonami
Copy link
Owner Author

Lonami commented Oct 7, 2022

As another example, if pyndustric compiled:

a = "Hey"

to

write 72 cell1 0
write 101 cell1 1
write 121 cell1 2

then

print(a)

couldn't possibly work because there is no way to turn the numbers back to a string in mlog.


Now, if pyndustric compiled:

a = "Hey"

to

set a "Hey"

then

print(a)

would work just fine (as it does now), but there's no way to store the characters in memory cells because mlog doesn't have to pick a string's bytes apart.

@Indieberrie
Copy link
Contributor

Indieberrie commented Oct 7, 2022

How about

a = "Hey"
print(a)

compiles to (I'm going to mix some higher langauge here)

# a = "Hey"
set a 0 #address of a on the heap (like pointer to 0th element in C)
write 3 cell2 0 #3 is length # cell1 is already stack
write 72 cell2 1 
write 101 cell2 2
write 121 cell2 3

# print(a)
for n in range(1,cell2[a]+1):
    print(cell2[a+n])
    #where print(number) here would go through massive switch case checking against ASCII table

I looked and haven't seen any instruction that takes in a string besides print, so it seems there is no need to concatinate (for now, correct me if I missed something)
Only thing besides print that takes string is jumps, so doing conditions on this string metal implementation could look like:

a = "aa"
b = "bb"
if a == b:
    print("equals")
# a = "aa"
    set a 0
    write 2 cell2 0
    write 97 cell2 1
    write 97 cell2 2

# b ="bb"
    set b 3
    write 2 cell2 3
    write 98 cell2 4
    write 98 cell2 5

#if a==b: ## I'm not writing assembly for this
    #equal length?
    if cell[a]==cell2[b]:
        for n in range(1,cell2[a]+1):
            if cell2[a+n] != cell2[b+n]:
                jump _notequals
    else:
        jump _notequals

_equals:
print("equals")
_notequals:

@Indieberrie
Copy link
Contributor

Indieberrie commented Oct 7, 2022

Now, that was demonstration of simple static arrays. Having array of strings would be simple array holding pointers to arrays of characters.
Doing dynamic ones would require something like malloc and free (which is not impossible, but costly). On top of those, something equivalent of "<"vector">" could be built (or anything that cpp uses as data structure to hold characters in string class)

@Lonami
Copy link
Owner Author

Lonami commented Oct 7, 2022

where print(number) here would go through massive switch case

With 95 printable characters, assuming the switch consists of set _tmp = "a"; jmp _ret for each letter, the switch would need at least 190 instructions, excluding any other call overhead and matching the variables. Which would need to be embedded into every single program which uses strings and prints. This is just not feasible.

Any string operation also becomes ridiculously expensive. Storing strings in memory is just not feasible (I would not even be willing to merge such a feature - it's too much).

Anyway we are deviating from the main topic of implementing lists and tuples, but essentially lists cannot be dynamically stored anywhere in a reasonable way so I won't be willing to merge such a thing for strings.

@Indieberrie
Copy link
Contributor

Indieberrie commented Oct 7, 2022

Ok, we throw out supporting strings that arent literals in registers. We are left with implementing stack (static) arrays that hold regular numbers (which is really useful and not that hard)
For dynamic ones, I have to think about price. Again, not impossible, but requires memory managment

@Lonami
Copy link
Owner Author

Lonami commented Oct 7, 2022

There is a problem implementing tuples (roughly "statically sized arrays") when variables are dynamically typed with no way of knowing a variable's type at runtime or compile time. Essentially, a 1-tuple is different from a 2-tuple, which is different from a 3-tuple, etc. So if you had:

tup = ("hello", "bye")
for elem in tup:
    print(tup)

When compiling for elem in tup:, how would the compiler know to generate an access for tup__0 and tup__1? It can't unless it tracked the type of the variable, which is not always possible. Why can't the compiler track the type?:

if random() > 0.5:
    tup = (1, 2, 3, 4)
else:
    tup = ("hello", "bye")

for elem in tup:
    print(tup)

Here it will either be a 2-tuple or a 4-tuple at runtime, but if tuples were to be implemented using infinite variables (to support strings, which can't go in memory cells), then knowing their length simply becomes impossible.

Perhaps we have to give up on implementing tuples.

@Lonami
Copy link
Owner Author

Lonami commented Oct 7, 2022

For dynamically sized lists, memory cells can indeed be used. But remember that a memory cell can be as small as 64 numbers in capacity. For this code:

if random() > 0.5:
    arr = [1, 2, 3, 4]
else:
    arr = [101, 102]

for elem in arr:
    print(elem)

When compiling the loop, the only information we have again is just "user is trying to iterate over arr".

Assuming the compiler stored "a pointer" in the arr variable, it could then access the length at cell[pointer], and the rest of elements starting at cell[pointer + 1]. But how do we know which cell? Perhaps when assigning a list the compiler actually creates set arr pointer; set arr__cell cell1, so when generating the loop it assumes it can check arr__cell for the cell's name (and it would silently fail if it turned out arr was not an array).

(All of this applies to number-only tuples too. Perhaps that's what should be implemented, just tuples backed by memory.)

Dynamically sized lists would mean the compiler also needs to write some sort of memory allocator, which means reserving some memory for the allocator itself, and would also need to embed routines to create and free allocations (at a minimum; a resize / move routine could also be added for more efficiency). If a list grows large enough that it doesn't even fit in a single memory cell, well, that's a fun problem: you would probably need to handle a dynamically sized amount of memory cells for compiling lists in the general case.


I think allocators and dynamic lists are completely overkill for this project. Logic processors in Mindustry have a limited amount of instructions. Embedding so much code, even if it were only embedded when the features are used, feels like far too much.

I would rather investigate how to improve the ergonomics of using memory cells.

I think supporting things such as:

start = 10
end = 20
for elem in Mem.cell1[start:end]:
    print(elem)

Mem.cell1[5:10] = Mem.cell1[25:30]
Mem.cell1[40:45] = [1, 2, 3, 4, 5]

...would mostly mitigate the need for having actual lists.

Non-growable lists (or tuples) can probably be implemented with the approach I described (compiler generates a pair of pointer + cell per declared list variable), but the fact it could silently fail with this approach (e.g. iterating over some variable when it's not actually a list) is not very nice.

If only iteration over Mem is allowed, then the compiler can properly fail when iterating anything else (meaning there is less room for error). I feel like this is the correct approach to go. You "only" lose the ability to name lists something more concrete, but realistically I don't expect programs to have so many lists where this becomes a problem.

@Indieberrie
Copy link
Contributor

Sorry it took a while to get back, I was super busy!

I thought about implementing tuples. We agree on a lot of stuff, but here are my caviats:

When compiling for elem in tup:, how would the compiler know to generate an access for tup__0 and tup__1? It can't unless it tracked the type of the variable, which is not always possible. Why can't the compiler track the type?:

Accessing tuple elements (if they are stored in infinite registers (as they probably should, sorry about my static-stack blip, having stuff beyond numbers in tuples is useful)) should be done via switch case (I don't see any ohter way because of register naming) so that compiler shouldn't generate access points when it encounters for loop, but from information about tuple assignment. I guess it would be reasonable to have some sort of lookahead and see what is the maximum length of tuple to generate switch case acording to that. (each tuple would have its own switch case) For loop would obviously iterate over actual number of elements. Also, compiler could (if told to, for example by flag) instert out-of-bound checks

In regards to what if later on programmer assigns a number to variable that was holding a tuple and then tries to iterate: let him shoot himself in the foot.
As I understand it is not job of a python interpreter to prevent people from doing that. Letting them know when that happens/could happen is nice and there might be few ways to go about it (print runtime error to message block or compiler gives you warning if you try to use same variable name for tuple and anything else)

So I guess tuple being represented as pointer_to_switch and length would be reasonable

Another thing would be to make elements mutable (there is no reason not to, it would just add more instructions to compiled program). The only requirement for tuples is to have size limited at compile time. (hence why my static-stack blip)


I checked largest memory cell and was super dissapointed when I figured out it was 512 adresses. I thought I've seen some K somewhere. Such harsh limitations...
You might be right that dynamic lists are overkill but philosophy of "pay only for what you use" is not alowing me to let go.

Perhaps we have to give up on implementing tuples.

I think tuples are a must and if we decide to give up on something heap idea is that!

Here are some ideas:

It makes sense to hold only numbers there.

Programmer knows how much heap he will give to processor (by how much he built ingame) so passing options to compiler as: [ {cell_number , cell_size} ]
would allow compiler to mash all the cells into one heap. (eg. if you give 4 cells each of 512 size you would have 2k heap. Cell number is so that compiler knows which cells are accessable)
All arrays (and objects, which don't work either) are on that one heap. Allocating more than there is on a heap should lead to runtime error. This would go around having "dynamically sized amount of cells". This way would require virtual/physical address translation to know which cell to access (it's maybe ~20 instructions more)

Memory allocator is a must tho! Having alloc and free isn't that big of a problem. Easiest algorithm would be to have a list of elements holding free sections of heap, and it would get choped up anytime there is alloc, and added to it anytime there is free.1


Ergonomic improvements for accessing cells is great idea.

Implementation of for loop should somehow account for all of the possibile things it can iterate over (There is a lot about this to discuss, so maybe later when we see about previous stuff)

Footnotes

  1. code example:

    # cell2 holds a list of free regions on the heap
    # Treat list_of_free_regions as the only array on cell2
    list_of_free_regions = cell2
    
    def alloc(size):
        for x in list_of_free_regions:
            if size < x.begin-x.end:
                temp=x.begin
                x.begin += size
                return temp
    
    def free(begin,end):
        for x in list_of_free_regions:
            if begin = x.end + 1:
                x.end = end
            elif end = x.begin - 1:
                x.begin = begin
            else:
                list_of_free_regions.push({begin:begin, end:end})
            # there is the case where freed memory is in between 2 other free blocks
            # but this is just proof of concept
            # I also might be forgeting something else
    

@Lonami
Copy link
Owner Author

Lonami commented Oct 13, 2022

Not sure how to best answer to your comment.

Tuples actually compiling to a switch seems reasonable, since we also want the following to compile:

index = 2
tup = ('a', 'b', 'c', 'd')
elem_c = tup[index]

And indeed the only way to do that at runtime within Mindustry's processor is some sort of switch or jump table. This worries me though:

n = 2
backup = None
for i in range(n):
    tup = ('a', 'b', i)
    if backup is None:
        backup = tup

print(backup[2])  # should print 0, but tup[2] is 1

The loop will run its body twice. On the first iteration, a tuple ('a', 'b', 0) is created. The compiler would have to create a jump table somewhere in the code for this. Because backup is None, it will be assigned tup. On the second iteration, at the same point as before, a tuple ('a', 'b', 1) is created. Where does this live? Does it overwrite the old jump table? If it does and backup still points to it, the print at the end will print 1 and not 0! Is another jump table created when doing backup = tup? How does the compiler know tup is a tuple though? Types are not being tracked (and is not always possible, see preivous comments). Is it just illegal to move tuples?

Regarding mutability of tuples, yeah, we can do that. It's not something normally allowed in Python but it can be done. pyndustric would be repurposing "tuples for fixed size, lists for dynamic", both mutable.

Programmer knows how much heap he will give to processor (by how much he built ingame) so passing options to compiler as: [ {cell_number , cell_size} ]
would allow compiler to mash all the cells into one heap. (eg. if you give 4 cells each of 512 size you would have 2k heap. Cell number is so that compiler knows which cells are accessable)

Not sure about this. I think the code should specify which cell to use, not a compiler option. But I can see your idea being "easier" on the programmer and harder on the compiler developer.

@Indieberrie
Copy link
Contributor

Can we just make tuples immovable and anytime someone tries to assign them, we actually provide reference?

@Lonami
Copy link
Owner Author

Lonami commented Oct 15, 2022

We cannot know the type of variables, so we cannot force them to be "immovable" (if by that you mean another_tuple = tuple_variable).

@Indieberrie
Copy link
Contributor

I mean that when someone tries to copy it (another_tuple = tuple_variable) instead of another_tuple holding a copy of tuple_variable , it holds a reference of tuple_variable, so any change to any of those 2 will change only one tuple that exists in memory. (kinda what javascript does)

@Indieberrie
Copy link
Contributor

Indieberrie commented Oct 15, 2022

It's your second idea:

If it does and backup still points to it, the print at the end will print 1 and not 0!

I think the only feasable way is to do it like that (by passing tuples around as reference)

@Lonami
Copy link
Owner Author

Lonami commented Oct 16, 2022

I mean that when someone tries to copy it (another_tuple = tuple_variable) instead of another_tuple holding a copy of tuple_variable , it holds a reference of tuple_variable, so any change to any of those 2 will change only one tuple that exists in memory. (kinda what javascript does)

Assuming tuple_variable contains the pc address to the tuple information in the variable, another_tuple = tuple_variable can remain being a simple set (as it is with any other values) so in essence it would work the same, yeah. As in:

tup_a = ('a', 'b')
tup_b = tup_a
for x in tup_b:
  print(x)

compiling to (kind of):

; tuple definition begin
set __res 2 ; tuple len
jmp __back
set __res "a" ; first elem
jmp __back
set __res "b" ; second elem
jmp __back
; tuple definition end
set tup_a 0 ; 0 is the addr of tuple definition
set tup_b tup_a ; addr is copied
; loops over a variable assume the variable is a tuple. first we would obtain the len of the tuple
call tup_b ; pretend call sets up __back then jmp like a function call
set __loop __res
call tup_b[2 + __loop * 2] ; access tuple element by index
print __res
set __loop + 1
cont __loop ; jmp to loop

Seems like we have a plan, next would be implementing it and adding tests.

In essence tuples compile to pretty much "functions" which need to be "called" at a certain index (offset 0 for len, offset 2 * (index + 1) for element at index) and jumped back from.

@Lonami
Copy link
Owner Author

Lonami commented Oct 16, 2022

Note that:

nested_tulpe = ( ("a", "b"), (1, 2), ("c", "d") )

should probably compile fine and result in 4 compiled jump tables (three for the inner ones, then one for the outer one).

@Lonami
Copy link
Owner Author

Lonami commented Oct 16, 2022

Maybe:

tup = ("a", "b")
a, b = tup

could also work (well, that would be another feature of tuples).

@Indieberrie
Copy link
Contributor

I'm glat that that's sorted out.

What should we do about figuring out type we are trying to iterate over?

@Lonami
Copy link
Owner Author

Lonami commented Oct 17, 2022

We do not care about that. Whenever the compiler sees for VAR in ITERABLE (where both VAR and ITERABLE are variable names the user created), we assume ITERABLE points to the pc address of the tuple definition with (len, items...). The tuple's elements can be anything, it doesn't matter.

Any of these expressions (and perhaps more) should assume ITERABLE contains the address of the jump table with (len, items...):

  • for x in ITERABLE
  • a, b = ITERABLE
  • a = ITERABLE[1]
  • a = len(ITERABLE)
  • ITERABLE[1] = a

@Indieberrie
Copy link
Contributor

Yes, it's simple if we only do tuples, but what about Env.links() and what if we also decide to implement heap arrays?

@Lonami
Copy link
Owner Author

Lonami commented Oct 17, 2022

Env.links() and others are special-cased "system calls". You can tell them apart because they have the form Namespace.name, as opposed to just a name the user can create. Those work in their own way and do not need to follow the rules of tuples.

The compiler can simply forbid links = Env.links() and only allow for x in Env.links() to work.

@Indieberrie
Copy link
Contributor

Hmmm, an idea that comes to my mind to prevent problems with tuples and dynamic arrays is for memory managment system to bake in at compile time forbidden addresses (pointers) that are used by tuples and to use other addresses. When we jump to iterate, first we check if we are having "tuple address" or something else, and if it's something else, we pass it to memory managment system

@Lonami
Copy link
Owner Author

Lonami commented Oct 17, 2022

This would add an extra check every time such an operation occurs, with an overhead of 2 instructions (compare and jump) at a minimum. We shouldn't get too fancy. The 1000 instruction limits isn't a high number.

@Indieberrie
Copy link
Contributor

Indieberrie commented Oct 17, 2022

I'll need to think about this then

@Lonami
Copy link
Owner Author

Lonami commented Oct 17, 2022

I would first implement tuples and then worry about dynamic arrays (dynamic arrays could be similar to tuples. Imagine they compile to the same switch table, but instead of jumping back, they jump elsewhere to determine the real address or similar: this would eliminate the check from every access and push the overhead to dynamic ones only).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants