Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excel Dynamic Arrays (Avoid Adding At-Signs to Formulas) #3962

Merged
merged 42 commits into from
Aug 12, 2024

Conversation

oleibman
Copy link
Collaborator

@oleibman oleibman commented Mar 27, 2024

This has come up a number of times, most recently with issue #3901, and also issue #3659, and issue #1834. It will certainly come up more often in days to come. Excel is changing formulas which PhpSpreadsheet has output as =UNIQUE(A1:A19); Excel is processing the formula as it were =@UNIQUE(A1:A19). This behavior is explained, in part, by #3659 (comment). It is doing so in order to ensure that the function returns only a single value rather than an array of values, in case the spreadsheet is being processed (or possibly was created) by a less current version of Excel which cannot handle the array result.

PhpSpreadsheet follows Excel to a certain extent; it defaults to returning a single calculated value when an array would be returned. Further, its support for outputting an array even when that default is overridden is incomplete. I am not prepared to do everything that Excel does for the array functions (details below), but this PR is a start in that direction. If the default is changed via:

use PhpOffice\PhpSpreadsheet\Calculation\Calculation;
Calculation::setArrayReturnType(Calculation::RETURN_ARRAY_AS_ARRAY);

When that is done, getCalculatedValue will return an array (no code change necessary). However, Writer/Xlsx will now be updated to look at that value, and if an array is returned in that circumstance, will indicate in the Xml that the result is an array and will include a reference to the bounds of the array. This gets us close, although not completely there, to what Excel does, and may be good enough for now. Excel will still mess with the formula, but now it will treat it as {=UNIQUE(A1:A19)}. This means that the spreadsheet will now look correct; there will be superficial differences, but all cells will have the expected value.

Technically, the major difference between what PhpSpreadsheet will output now, and what Excel does on its own, is that Excel supplies values in the xml for all the cells in the range. That would be difficult for PhpSpreadsheet to do; that could be a project for another day. Excel will treat the output from PhpSpreadsheet as "Array Formulas" (a.k.a. CSE (control shift enter) formulas because you need to use that combination of keys to manually enter them in older versions of Excel). Current versions of Excel will instead use "Dynamic Array Formulas". Dynamic Array Formulas can be changed by the user; Array Formulas need to be deleted and re-entered if you want to change them. I don't know what else might have to change to get Excel to use the latter for PhpSpreadsheet formulas, and I will probably not even try to look now, saving it for a future date.

Unit testing of this change uncovered a bug in Calculation::calculateCellValue. That routine saves off ArrayReturnType, and may change it, and is supposed to restore it. But it does not do the restore if the calculation throws an exception. It is changed to do so.

This is:

  • a bugfix
  • a new feature
  • refactoring
  • additional unit tests

Checklist:

  • Changes are covered by unit tests
    • Changes are covered by existing unit tests
    • New unit tests have been added
  • Code style is respected
  • Commit message explains why the change is made (see https://github.com/erlang/otp/wiki/Writing-good-commit-messages)
  • CHANGELOG.md contains a short summary of the change and a link to the pull request if applicable
  • Documentation is updated as necessary

Why this change is needed?

Provide an explanation of why this change is needed, with links to any Issues (if appropriate).
If this is a bugfix or a new feature, and there are no existing Issues, then please also create an issue that will make it easier to track progress with this PR.

This has come up a number of times, most recently with issue PHPOffice#3901, and also issue PHPOffice#3659. It will certainly come up more often in days to come. Excel is changing formulas which PhpSpreadsheet has output as `=UNIQUE(A1:A19)`; Excel is processing the formula as it were `=@unique(A1:A19)`. This behavior is explained, in part, by PHPOffice#3659 (comment). It is doing so in order to ensure that the function returns only a single value rather than an array of values, in case the spreadsheet is being processed (or possibly was created) by a less current version of Excel which cannot handle the array result.

PhpSpreadsheet follows Excel to a certain extent; it defaults to returning a single calculated value when an array would be returned. Further, its support for outputting an array even when that default is overridden is incomplete. I am not prepared to do everything that Excel does for the array functions (details below), but this PR is a start in that direction. If the default is changed via:
```php
use PhpOffice\PhpSpreadsheet\Calculation\Calculation;
Calculation::setArrayReturnType(Calculation::RETURN_ARRAY_AS_ARRAY);
```
When that is done, `getCalculatedValue` will return an array (no code change necessary). However, Writer/Xlsx will now be updated to look at that value, and if an array is returned in that circumstance, will indicate in the Xml that the result is an array *and* will include a reference to the bounds of the array. This gets us close, although not completely there, to what Excel does, and may be good enough for now. Excel will still mess with the formula, but now it will treat it as `{=UNIQUE(A1:A19)}`. This means that the spreadsheet will now look correct; there will be superficial differences, but all cells will have the expected value.

Technically, the major difference between what PhpSpreadsheet will output now, and what Excel does on its own, is that Excel supplies values in the xml for all the cells in the range. That would be difficult for PhpSpreadsheet to do; that could be a project for another day. Excel will treat the output from PhpSpreadsheet as "Array Formulas" (a.k.a. CSE (control shift enter) formulas because you need to use that combination of keys to manually enter them in older versions of Excel). Current versions of Excel will instead use "Dynamic Array Formulas". Dynamic Array Formulas can be changed by the user; Array Formulas need to be deleted and re-entered if you want to change them. I don't know what else might have to change to get Excel to use the latter for PhpSpreadsheet formulas, and I will probably not even try to look now, saving it for a future date.

Unit testing of this change uncovered a bug in Calculation::calculateCellValue. That routine saves off ArrayReturnType, and may change it, and is supposed to restore it. But it does not do the restore if the calculation throws an exception. It is changed to do so.
@oleibman oleibman marked this pull request as draft March 27, 2024 00:24
@MarkBaker
Copy link
Member

There was a lot of code that I'd implemented in #2787 that was designed to provide this full support for array formulae, including the handling of ranged returns and support for dynamic ranging with all the relevant tests and some of the new dynamic array functions.
Unfortunately it got too wrapped with other issues and resolutions making it difficult to extract just those parts of the code, and did require a change to the Read Filters that made it a BC break.

Thinking about PHPOffice#3958 - user wondered if unsupported formulas with array results could be handled better. I said that the answer was "no", but I think Xlsx Reader can make use of the dimensions of the result after all, so the answer is actually "sometimes". This is an initial attempt to do that. Implementing it revealed a bug in how Xlsx Reader handles array formula attributes, and that is now corrected. Likewise, Xlsx Writer did not indicate a value for the first cell in the array, and does now.
@oleibman
Copy link
Collaborator Author

@MarkBaker Thanks, I will look over your code. We took very different approaches. I'll see if I can can reconcile some of the differences. Did you ever figure what was needed to make Excel open the PhpSpreadsheet-generated file so that the array functions are treated as dynamic rather than CSE?

@jr212
Copy link

jr212 commented Apr 18, 2024

The solution above doesn't work for me.
My test files(php and a json for the data) can be found on my website https://janr.be/excel.zip. I leave it there for ±1month

Address sample code submitted by @jr212 which was not working correctly.
@oleibman
Copy link
Collaborator Author

@jr212 Thank you for the sample code. Please try again against this PR, which I have changed based on your example. That demonstrated at least 2 problems with the existing code. One, relatively easily handled, was that, since your code used auto-sized columns, the PhpSpreadsheet code wound up formatting the array results, and the formatting code was not expecting array inputs. I will note the array calculation supplies a value only to the first cell of the array, so the auto-sizing is ineffective anyhow. I will think about this some more; perhaps it is something we can live with, or perhaps the code can do better.

The second problem is a little more troublesome. A formula like =sheet!cell winds up returning a 1-element array (row->column->value). With "normal" array handling, this winds up reduced to value. But, with RETURN_ARRAY_AS_ARRAY, it remains as an array. I have added code to recognize this situation and return just the value part, but I'm not entirely convinced that this change won't fail due to false positives or negatives.

Finally, even though the Excel spreadsheet generated by your code is now correct (I think - please verify), I don't think that getCalculatedValue for cell A2 on sheet 'Gesorteerd verlorenpunten' agrees with what Excel shows.

All in all, your example is very useful. But it may cause me to leave this change in draft status for quite a bit longer than I had planned.

@jr212
Copy link

jr212 commented Apr 19, 2024 via email

@oleibman
Copy link
Collaborator Author

@jr212 It is lucky then that I didn't realize your initial problem was with Composer, otherwise I might not have noticed the code problems. I am also not expert in Composer, but I think the following should work. Assume that you have a main directory called "git" (change to whatever is appropriate). Go to that directory and issue the following command:

git clone -b atsign --single-branch https://github.com/oleibman/PhpSpreadsheet.git atsign3

This will create a new directory "git/atsign3". Change to that directory and run:

composer install --prefer-source

It would not surprise me in the least if there is a simpler way to do this, but this should work.

@oleibman
Copy link
Collaborator Author

Merged master to eliminate a no-longer-valid dependency in composer.lock, not as prep work before installing.

@jr212
Copy link

jr212 commented Apr 21, 2024

I've finally succeeded to build everything 😊
Never worked with git before but no problem there.
Test is now complete, with success and in production. Remains only the width of columns. (I can live with that) You wrote before you look at the header of the column (first cell) but people’s names are bigger than only “Name” or in nl-be “Naam”

Questions now
• What files/folder are needed for all functions? I’ve placed now the vendor and src folder but that is 6.98 GiB.
The file PhpSpreadsheet_2.0\vendor\phpstan\phpstan.git\objects\pack\pack-35b3e82d34e6138656d21f8ef47f303a6a20f372.pack alone is 4.8 Gib.
• Do you still need my test code
• Is this normal behaviour? first_open
Above when opening the file all is 0 except the first cell. After click "allow edit" (or something like that) all is fine(recalculated).

My complements for all the hard work in this project and the help to me.

Jan

@oleibman
Copy link
Collaborator Author

@jr212 I believe you can get a much smaller directory by using the following install command rather than the one I initially suggested:

composer install --no-dev

Please bear in mind that the code you will be using is some way from being "production ready". I will be glad to continue to work with you, but this code may never make it into production, and, even if it does, I don't know how soon that might happen.

On Windows systems, when a spreadsheet is downloaded from a browser, the behavior you described is normal.

I don't need you to maintain the zip file you uploaded any longer. However, I will eventually need to add a test case based on what it showed me. Do you have a problem with my basing that test case on the data and code you supplied?

@jr212
Copy link

jr212 commented Apr 22, 2024 via email

@jr212
Copy link

jr212 commented May 2, 2024 via email

@oleibman
Copy link
Collaborator Author

@jr212 Your spreadsheet is no longer available at the address you supplied, probably because I took too long to download it. If you have something you want me to look at, please make it available again.

@jr212
Copy link

jr212 commented May 13, 2024 via email

See issue PHPOffice#4062. When calculating an array formula, populate all the cells associated with the result. This is almost the same as Excel's behavior. As yet, there is no attempt to create a #SPILL error, so cells may be inappropriately overwritten. Also, if the array size shrinks (e.g. there are fewer unique values than before),  no attempt is made to unpopulate the cells which were in range but are now outside the new dimensions. Spill and unpopulation are somewhat related, and will probably be handled at the same time, but their time has not yet come.

UNIQUE, at least for rows, was treating all cell (calculated) values as strings. This is not the same behavior as Excel, which will preserve datatypes, and treat int 3 and string 3 as unique values. Excel will, however, treat int 3 and float 3.0 as non-unique. Within UNIQUE, private function uniqueByRow is changed to try to preserve the the datatype when executing (it will probably treat 3.0 as int - I don't know how I can, or even if I should attempt to, do better - but no int nor float should be treated as a string).
Frustrating morning.
I think I should go back to bed.
ArrayFunctions2Test - the calculations seem too complicated for PhpSpreadsheet. The debug log is 21,300 lines, so I don't know how far I will get with it.
@oleibman
Copy link
Collaborator Author

oleibman commented Jun 9, 2024

I think I have finally cracked how to prevent Excel from changing the formula to CSE format. It's hideous.

  • Spreadsheet includes a metadata.xml file which includes a dynamicArrayProperties tag.
  • File workbook.xml.rels has a reference to metadata.xml.
  • File [ContentTypes].xml has a reference to metadata.xml.
  • A cell with a formula which returns an array result has attribute cm="1".

I think it's probably doable, but it might be fragile, especially if other features need to be enabled in this way.

With a number of changes, PhpSpreadsheet can finally generate a spreadsheet which Excel will recognize as a Dynamic Array function rather than CSE. In particular, changes are needed to ContentTypes, workbook.xml.rels, cell definitions in the worksheet, and a new metadata.xml is added.
@oleibman
Copy link
Collaborator Author

I think that if I can figure out SPILL, it will be time to move this out of draft status.

@oleibman
Copy link
Collaborator Author

oleibman commented Jul 8, 2024

@infojunkie This is not just a problem for the concatenation operator - it is a problem for other operators (e.g. +) and functions (e.g. SUM). I think you can get the result you want for the first cell by wrapping the operand in the SINGLE function. I have to think about whether that's good enough (and am very interested to hear your opinion on that subject). Alternatively, I can see changing all single-cell intermediate evaluations to return just a single value ... but then I have to figure out how to avoid doing that for ANCHORARRAY. I may have to revert this to a draft if I don't come up with a resolution soon. Google's extensions (like ARRAYFORMULA) aren't precisely in scope, but I will have to think about them as well.

@oleibman oleibman marked this pull request as draft July 9, 2024 19:26
@oleibman
Copy link
Collaborator Author

oleibman commented Jul 9, 2024

Very disappointing - I think the observation from @infojunkie is a show-stopper, and I haven't yet figured out how to overcome it. Converting back to draft while I do so.

@oleibman oleibman changed the title Excel Dynamic Arrays (Avoid Adding At-Signs to Formulas) WIP Excel Dynamic Arrays (Avoid Adding At-Signs to Formulas) Jul 9, 2024
Some examples submitted by @infojunkie have demonstrated that an anchor cell without a spill operator was not being handled consistently with Excel. This means that the calculated value of such a cell needs to be a scalar when using it as part of a formula without the spill operator, but as an array when using it with the spill operator or when just getting the cell's value. This is tricky. My solution seems awfully kludgey to me, but it does seem to work.

I have another change that I will want to make in a day or so. When that change is pushed, I will take this back out of draft status, and will now aim for an install date of about August 6.

This particular commit removes the changing of array return type during calculations. It's been this way for a very long time, but I don't understand why it should have been needed in the first place. It causes problems (you need to be sure to restore the original value even when, for example, you throw an exception during calculation). I am relieved that it caused a problem only for one test member. The TEXTSPLIT function really makes sense only when you are returning arrays as arrays. Its test now needs to set that value explicitly since the Calculation engine is no longer changing it under the covers. No other tests broke as a result of this change.

One other test "broke" as a result of this commit. One of the tests for INDIRECT had been expecting a null value, and the test was commented with "Excel result is 0". The PhpSpreadsheet result is now 0 as well, so this would seem to be a bugfix rather than a breaking change.
@oleibman
Copy link
Collaborator Author

@infojunkie Please try again with the latest commit.

@infojunkie
Copy link
Contributor

infojunkie commented Jul 10, 2024

Thanks @oleibman running my same test above on the latest commit yields the expected result:

Testing cache value for cell Sheet1!A12
Evaluating formula for cell Sheet1!A12
Formula for cell Sheet1!A12 is A1
Sheet1!A12 => Evaluating Cell A1 in worksheet Sheet1
Sheet1!A12 => Testing cache value for cell Sheet1!A1
Sheet1!A12 => Evaluating formula for cell Sheet1!A1
Sheet1!A12 => Formula for cell Sheet1!A1 is RANDARRAY(10,1, 0, 10)
Sheet1!A12 -> Sheet1!A1 => Evaluating Function RANDARRAY() with 4 arguments
Sheet1!A12 -> Sheet1!A1 => Evaluating RANDARRAY ( 10, 1, 0, 10 )
Sheet1!A12 -> Sheet1!A1 => Evaluation Result for RANDARRAY() function call is a matrix with a value of { 8.3398524710628; 2.2249379997258; 9.3872484887891; 3.9598729340173; 5.264256799251; 0.50387523626158; 0.3429376754644; 7.3651627112949; 7.6823210984852; 0.31232020832241 }
Sheet1!A12 => Evaluation Result for cell 'Sheet1'!A1 in worksheet Sheet1 is a floating point number with a value of 8.3398524710628
Sheet1!A12 => Scalar Result for cell 'Sheet1'!A1 is a floating point number with a value of 8.3398524710628
Testing cache value for cell Sheet1!A13
Evaluating formula for cell Sheet1!A13
Formula for cell Sheet1!A13 is "Hello: "&A1
Sheet1!A13 => Evaluating Cell A1 in worksheet Sheet1
Sheet1!A13 => Testing cache value for cell Sheet1!A1
Sheet1!A13 => Retrieving value for cell Sheet1!A1 from cache
Sheet1!A13 => Evaluation Result for cell 'Sheet1'!A1 in worksheet Sheet1 is a floating point number with a value of 8.3398524710628
Sheet1!A13 => Scalar Result for cell 'Sheet1'!A1 is a floating point number with a value of 8.3398524710628
Sheet1!A13 => Evaluating "Hello: " & 8.3398524710628
Sheet1!A13 => Evaluation Result is a string with a value of "Hello: 8.3398524710628"
Testing cache value for cell Sheet1!A14
Evaluating formula for cell Sheet1!A14
Formula for cell Sheet1!A14 is A3
Sheet1!A14 => Evaluating Cell A3 in worksheet Sheet1
Sheet1!A14 => Evaluation Result for cell 'Sheet1'!A3 in worksheet Sheet1 is a floating point number with a value of 9.3872484887891
Sheet1!A14 => Scalar Result for cell 'Sheet1'!A3 is a floating point number with a value of 9.3872484887891
Testing cache value for cell Sheet1!A15
Evaluating formula for cell Sheet1!A15
Formula for cell Sheet1!A15 is "Hello: "&A3
Sheet1!A15 => Evaluating Cell A3 in worksheet Sheet1
Sheet1!A15 => Evaluation Result for cell 'Sheet1'!A3 in worksheet Sheet1 is a floating point number with a value of 9.3872484887891
Sheet1!A15 => Scalar Result for cell 'Sheet1'!A3 is a floating point number with a value of 9.3872484887891
Sheet1!A15 => Evaluating "Hello: " & 9.3872484887891
Sheet1!A15 => Evaluation Result is a string with a value of "Hello: 9.3872484887891"
/home/kratib/src/quartech/workbc-cc-refactor/src/scripts/spreadsheets/test-dynamic-array-formulas.php:31:
array(4) {
  'A12' =>
  double(8.3398524710628)
  'A13' =>
  string(22) "Hello: 8.3398524710628"
  'A14' =>
  double(9.3872484887891)
  'A15' =>
  string(22) "Hello: 9.3872484887891"
}

This latest fix of yours also fixed a follow-up crash I was encountering in my test script when I attempted to reference cells from the B column, which itself is a SORT on the results of the RANDARRAY - a double spill :-D

Great job!!

Till now we have used a static variable/getter/setter to decide what type of result should be returned when a formula is evaluated and an array is the result. This is messy; it would be much better to use an instance variable instead. We cannot eliminate `setArrayReturnType` and `getArrayReturnType` because that would be a BC break. I am considering whether they should be deprecated. In the meantime, I have added a new instance property `instanceArrayReturnType` with getter and setter methods. The property is initially null, and, if it remains so when needed, the static property will be used instead. However, if it is set, its value will be used.
@oleibman oleibman marked this pull request as ready for review July 10, 2024 18:37
@oleibman
Copy link
Collaborator Author

Ready again. New tentative install date August 7.

@oleibman oleibman changed the title WIP Excel Dynamic Arrays (Avoid Adding At-Signs to Formulas) Excel Dynamic Arrays (Avoid Adding At-Signs to Formulas) Jul 10, 2024
@oleibman
Copy link
Collaborator Author

@PowerKiKi @MarkBaker I would like to create (= "tag"?) release 2.2 next week to clear the deck for this change a few weeks later as the start of release 3.0. I haven't created a release before, so, if you want to do it, that would be fine; but, if you're okay with me trying it, I would like to make sure it happens at a time which is convenient for you to salvage things if I mess up. Can you suggest a time (with timezone) which would work for you next Wednesday or Thursday?

@PowerKiKi
Copy link
Member

PowerKiKi commented Jul 19, 2024

I am not following the project as closely as I used to, but I also love coming back to cut releases 😍

But it's totally fine for you to do it. The process is meant to be simple. AFAIK there are only two things you can screw up:

  • be sure that CI is green, before tagging (duh!)
  • if you use git cli to create the tag, it is likely that git is configured to interpret # as a marker for comments. But you will copy paste markdown which will absolutely contains # for all the titles. My recommendation is to configure git to use a different character for comment markers once and for all. One possibility to do that is to run git config --global core.commentChar ';'. I have been using ; for comments for years now, and it is both very natural and conflicts with nothing I use. (so in that regard, I consider the point 4. of the process, which was added by @MarkBaker, a bit misleading)

I enabled GitHub notifications for new releases. So I suppose you can cut a release whenever you want (unless if Mark is working on something he'd like to include ?), and I'll be double-checking...

PS: don't use prefix like v2.2.0, and don't skip trailing zeroes like 2.2. Do 2.2.0

@oleibman
Copy link
Collaborator Author

@PowerKiKi I released 2.2.0. It's in packagist. However, CI tells me that the "release" step failed - see https://github.com/PHPOffice/PhpSpreadsheet/actions/runs/10077609508/job/27860519974. I don't know if that means something that needs to be done hasn't happened yet. Please take a look.

@infojunkie
Copy link
Contributor

I'm not seeing this PR in the release notes of 2.2.0, nor in the code of master. So if your intention was to merge and release this PR, it hasn't happened.

@oleibman
Copy link
Collaborator Author

@infojunkie 2.2.0 was "clearing the decks" for this change. My plan is for it to be part of a new major release (3.0.0). I still plan to merge this to master around August 7 (leaves time to make sure there are no major problems with 2.2.0), although I do not yet have a date in mind for 3.0.0.

@PowerKiKi
Copy link
Member

@PowerKiKi I released 2.2.0. It's in packagist. However, CI tells me that the "release" step failed - see https://github.com/PHPOffice/PhpSpreadsheet/actions/runs/10077609508/job/27860519974. I don't know if that means something that needs to be done hasn't happened yet. Please take a look.

I am also not sure what happened. But it seems it didn't matter all that much anyway. I noticed this did not happen again with my just released 2.2.1: https://github.com/PHPOffice/PhpSpreadsheet/actions/runs/10140340976/job/28035354511 🤷

This will, I hope, be my last change prior to merge on August 7. PR is fully synced with master (except for this change), and, except for an emergency, I do not intend to merge anything else before this.
@oleibman
Copy link
Collaborator Author

oleibman commented Aug 7, 2024

I had planned to merge this today, but first I need to evaluate issue #4128 (possible break in 2.2.0/2.2.1). New tentative date August 14.

@oleibman
Copy link
Collaborator Author

2.2.2 is confirmed to fix problem with Excel 2016, and no new 2.2.* problems have surfaced. Advancing merge date to Monday Aug. 12.

@oleibman oleibman added this pull request to the merge queue Aug 12, 2024
Merged via the queue into PHPOffice:master with commit 6a8eda9 Aug 12, 2024
12 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants