Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LaTeX export for ACM-BCB #943

Merged
merged 46 commits into from
May 1, 2021
Merged

Add LaTeX export for ACM-BCB #943

merged 46 commits into from
May 1, 2021

Conversation

agitter
Copy link
Collaborator

@agitter agitter commented Apr 27, 2021

This adds preliminary LaTeX export for the ACM-BCB submission. The big picture is that we can automatically generate a base document for the submission, but it will require a fair amount of manual editing before it will build with the ACM sigconf proceedings template. Because of our tight timeline, we'll need to decide how much more to automate versus prioritizing manuscript content.

There is now a list individual-docx-manuscripts.txt of manuscripts to export as docx and another list individual-latex-manuscripts.txt to export for LaTex. Exporting for LaTeX generates a .tex and a .bib file that contains reference metadata from Manubot. Hopefully those two outputs are work together. I haven't tried building a PDF yet.

This builds on two experimental rootstock pull requests: manubot/rootstock#384 and manubot/rootstock#386 One of those upgrades the version of Pandoc. That will cause some changes to our HTML and PDF outputs, so we'll need to check those carefully before merging. I also had to further update the environment to resolve new package incompatibilities. However, the newer version of Pandoc is needed to extract the .bib file.

When outputting .tex files, Pandox uses a LaTeX template. This is different from what ACM refers to as a template, which is an example .tex file. The Pandoc version of a template can access metadata variables, like our Markdown template for the front matter. This strategy isn't compatible with how Manubot writes author information and some other metadata. I can resolve some of these incompatibilties, but we'll need to decide what missing metadata (authors? affiliations?) is high priority.

I didn't test images. Those will probably not work immediately.

@misc URL citations currently don't include the URL in the bib file.

Here are the current outputs (.txt added so GitHub will allow the attachment):

dhimmel and others added 20 commits November 9, 2020 21:26
panflute upgrade required (not yet on conda-forge, so swith to PyPI)
includes fix to manubot/rootstock#386 (comment):
Element "MetaList" received "CSL_Item" but expected <class 'panflute.base.MetaValue'>
From pandoc 2.11 (2020-10-11) changelog:

> Add CSS to default HTML template (#6601, Mauro Bieg). This greatly
improves the default typography in pandoc’s HTML output. The CSS is
sensitive to a number of variables (e.g. mainfont, fontsize, linestretch):
see the manual for details. To restore the earlier, more spartan output,
you can disable this with -M document-css=false.
Initialize methods authors for testing
Disable individual docx outputs
@AppVeyorBot
Copy link

AppVeyor build 1.0.4002

@AppVeyorBot
Copy link

AppVeyor build 1.0.4004 for commit d97e5f6 is now complete.

Found 15 potential spelling error(s). Preview:content/09.evolution.md:83:nonsynonymous
content/09.evolution.md:139:LVNA
content/23.vaccines-app.md:15:IgGs
content/23.vaccines-app.md:387:IgGs
content/60.methods.md:8:CCS
content/60.methods.md:64:ECRs
content/60.methods.md:80:Manubot's
content/60.methods.md:110:scite
content/60.methods.md:142:scite
content/60.methods.md:159:ECR
content/60.methods.md:160:ECRs
content/60.methods.md:164:docx
content/60.methods.md:166:scite...
The rendered manuscript from this build is temporarily available for download at:

@rando2
Copy link
Collaborator

rando2 commented Apr 28, 2021

Thank you so much for working on this @agitter! I think as long as we are getting the markdown document into something that is vaguely compatible with the template, it will be a lot easier to clean up the template. I'm not sure how well I can review this PR because it is really complex. @mprobson has also said he can take a look, but I imagine he'll have the same issue with not quite understanding the under-the-hood of Manubot well enough. I can definitely help with reviewing the outputs, though.

@rando2 rando2 added the Methods Strategies for review label Apr 28, 2021
@agitter
Copy link
Collaborator Author

agitter commented Apr 28, 2021

I'm happy to walk through some of the build script changes if you comment on anything you're curious about. The build.sh script is a little messy because I copied a large code block while trying to debug the tex export.

The main issue is that the newer version of Pandoc does make the HTML and PDF outputs look different. A short term workaround would be to create two conda environments and manually toggle between them when we're preparing LaTeX export versus making general builds the rest of the time. It would be better to fix the formatting issues, but we have so little time.

For review, you can ignore all the build/assets/acmart-master files. Those are just the files I downloaded from ACM. The main focus of the review (where @mprobson may be able to help) is the two attachments in my original post. We should identify any major problems in the exported .tex file that I can try to fix (e.g. authors) so we can move on to the content.

@mprobson
Copy link
Collaborator

@agitter this is an amazing effort! I looked through the committed files and I think I understand most of what they're doing. I'll try to chime in with what I can parse but apologies if I missed something.


As far as I can tell the generated .bib file looks correct.


We should be able to generate a rough pdf in the file stage of the build.sh file using one of two methods, depending on what's available in the CI environment:

latexmk -pdf output/$INDIVIDUAL_KEYWORD-manuscript.tex && latexmk -pdf -c output/$INDIVIDUAL_KEYWORD-manuscript.tex

or

pdflatex output/$INDIVIDUAL_KEYWORD-manuscript.tex &> /dev/null
bibtex output/$INDIVIDUAL_KEYWORD &> /dev/null  # not sure about this line if we're using natbib
pdflatex output/$INDIVIDUAL_KEYWORD-manuscript.tex &> /dev/null
pdflatex output/$INDIVIDUAL_KEYWORD-manuscript.tex &> /dev/null

I use the first command locally to build my finished pdfs (and clean the outputs, everything after the && is just housekeeping) and the seconds set of commands is from this blog post, which links to bibtex's instructions. We may even be able to use pandoc, e.g. pandoc methods.md -o methods.pdf and potentially skip the .tex and .bib file generation step.


Also, I don't know that all the build/assets/acmart-master/* files are necessary. The acmart style should be available by default depending on the environment and LaTeX installation. If not, we can probably install it via whatever LaTeX distribution we have. I think we'd only want to include the .cls file (and not everything else) if none of that works.


The majority of my observations are about the generated .tex file.

  1. We should have a method to manually specify the style as docmentclass: acmart and classoptions: sigconf (see below).

  2. In the .tex file I expected the bibliography to look something like:

\bibliographystyle{ACM-Reference-Format}
\bibliography{methods.bib}

The linked Pandoc-to-LaTeX template does have functionality to generate this but I don't see it being employed even though I see the natbib option in latex.yaml.

  1. In general, it looks like Pandoc expects a yaml stub at the top of the file to fill in all the variables. It appears we can generate and provide this separately via the metadata-files option, e.g. the list of authors and per paper specific LaTeX/YAML configurations, including the document class and bibliography style options mentioned above.

  2. I think we should consider maintaining (or forking) our own template file. You linked several good examples and I'll add one more here: acm-pandoc-conf.tex from this (very helpful) blog post. I can try and create a minimal working example if that would be helpful although I am still trying to figure out how to test the Manubot pipeline.

@agitter
Copy link
Collaborator Author

agitter commented Apr 28, 2021

Thanks for reviewing. You have a good understanding of what's going on.

We should be able to generate a rough pdf in the file stage of the build.sh file

This should be possible, but I anticipate we'll run into a lot of minor details when trying to fully automate pdf generation that will slow us down. manubot/rootstock#249 and manubot/rootstock#256 have additional context about building the pdf in the continuous integration environment with pandoc (the old Travis CI one, which was less flexible).

The workflow I had in mind was

  • automatically generate .tex and .bib source files
  • (optional) run a simple script to fix some .tex file issues via string replacement
  • download and manually polish the .tex to fix issues noted above and others
  • locally build a pdf

What do you think about that?

I don't know that all the build/assets/acmart-master/* files are necessary

I can delete these and keep only the .cls file. My goal was to provide what is needed for someone building locally.

In general, it looks like Pandoc expects a yaml stub at the top of the file to fill in all the variables.

Yes, Manubot has a different convention for setting authors, but we can modify the metadata in the yaml stub or pass metadata to pandoc as a command line argument. I'm passing the title via the command line for individual manuscripts to override the title in the yaml block. Author metadata is probably worth automating before we merge.

I am still trying to figure out how to test the Manubot pipeline.

To run this locally

  • Create and activate the conda environment (instructions) using the updated environment.yml from this branch
  • From the root directory of the repository run BUILD_HTML=false BUILD_PDF=false BUILD_INDIVIDUAL=true build/build.sh (save time by not building the full manuscript, only the methods manuscript)

If I'm forgetting a step, I can help debug. I agree that we would need to fork a template if we want to perfect the automated build. If we instead can settle with a .tex file that has content that can by copied into the ACM template, we can live with a lot of errors in the template.

@agitter
Copy link
Collaborator Author

agitter commented Apr 30, 2021

Edit: I realized every time I change the gist you need to update the metadata link to it.

That's not a problem for me. I'm editing these files regularly anyway. We can integrate it into this repo after we merge this pull request.

I'll work on the conflicts statement after pushing some text edits to #947.

Copy link
Collaborator

@mprobson mprobson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks good (pending conflicts integration). I'm not in a place to comment on build/update-latex-metadata.py but it seems to work.

Minor / Unimportant notes (please feel free to ignore - not blocking merge):

  • Can remove acmart.bib and sample-sigconf.tex (unused)
  • Can add (back) Makefile, bbx, cbx, dbx (but I'm not sure if we need?)

@mprobson
Copy link
Collaborator

Just pushed a change to the gist enabling bibfile which automatically adds the right bibliography line. Now we just need to automate removing the CSLreference section.

@mprobson
Copy link
Collaborator

I noticed we need the ACM copyright info. I can hardcode but the template could ingest the following (I'm just not sure where to check it in):

---
acm:
- copyrightyear: 2021
  copyright: acmcopyright
  conference: ACM-BCB '21
  conferencetitle: ACM-BCB '21: ACM Conference on Bioinformatics, Computational Biology, and Health Informatics
  date: August 01--04, 2021
  location: Online
...

@mprobson
Copy link
Collaborator

Outstanding things on my TODO list (I think this all needs to be done manually?):

  • Move front matter into proper \begin{}...\end{} sections (may need to do by hand)
    • Abstract
    • Keywords
    • CCS
  • Remove section number from acknowledgements (again, may be by hand)

@rando2
Copy link
Collaborator

rando2 commented Apr 30, 2021

@mprobson I usually manually change the acknowledgements, do you want me to send you the correct ones for this paper?

Edited to add: have to go into a meeting so doing this here just in case--

We are grateful to Josh Nicholson and Milo Mordaunt for their support with the scite plugin, and to David Nicholson for the suggestion and feedback to enable the reporting of the locations of spelling errors in the spell-checker tool. We thank Nick DeVito for assistance with the Evidence-Based Medicine Data Lab COVID-19 TrialsTracker data.

@agitter
Copy link
Collaborator Author

agitter commented Apr 30, 2021

I can hardcode but the template could ingest the following

I'll update to add that block into the YAML I generate.

Minor / Unimportant notes

I can fix those.

@AppVeyorBot
Copy link

AppVeyor build 1.0.4065 for commit f7780b8 is now complete.

Found 16 potential spelling error(s). Preview:content/09.evolution.md:83:nonsynonymous
content/09.evolution.md:139:LVNA
content/23.vaccines-app.md:15:IgGs
content/23.vaccines-app.md:387:IgGs
content/60.methods.md:8:CCS
content/60.methods.md:65:ECRs
content/60.methods.md:81:Manubot's
content/60.methods.md:111:scite
content/60.methods.md:111:Scite
content/60.methods.md:147:scite
content/60.methods.md:164:ECR
content/60.methods.md:165:ECRs
content/60.methods.md:169:docx...
The rendered manuscript from this build is temporarily available for download at:

@mprobson
Copy link
Collaborator

I usually manually change the acknowledgements...

That's great, thanks! I think the bigger issue is that we need to use \section*{Acknowledgments} and I'm not sure how to automate that.

@mprobson mprobson mentioned this pull request Apr 30, 2021
2 tasks
@agitter
Copy link
Collaborator Author

agitter commented Apr 30, 2021

I pushed changes that resolve most/all of the issues above:

  • clean up the included ACM files
  • only show the CCS concepts section when exporting as tex
  • add the ACM metadata and conflicts statement for the template
  • suppress the references section in the markdown
  • exclude 70.coi-contribs.md (conflicts table, contribution table, acknowledgements)
  • add a new markdown file for methods acknowledgements and make the section unnumbered

The acknowledgments section now shows up as

\hypertarget{acknowledgements}{%
\section*{Acknowledgements}\label{acknowledgements}}
\addcontentsline{toc}{section}{Acknowledgements}

@rando2 you can edit the content in 61.methods-ack.md if you need to edit the acknowledgements text for this individual manuscript.

I'm happy with this tex output. It's a lot cleaner than I would have guessed. From here, should we create a new issue with a single checklist of manual changes needed before submitting?

Latest version:
methods-manuscript.pdf
methods-manuscript.tex.txt

@agitter
Copy link
Collaborator Author

agitter commented Apr 30, 2021

I just realized we don't have a funding statement anywhere. Should we add that to 61.methods-ack.md? I think that would be best in terms of making it quickly and saving space.

@mprobson
Copy link
Collaborator

mprobson commented Apr 30, 2021

Wow, that's amazing! I agree re: funding statement. I usually put my sources in the acknowledgements.

I'm currently getting a weird error trying to build the latest commits:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
KeyError: 'commit'

I suspect that something has gone wrong with my python but wanted to confirm it's working for you.

@agitter
Copy link
Collaborator Author

agitter commented Apr 30, 2021

I suspect that something has gone wrong with my python but wanted to confirm it's working for you.

Strange. It is working for me. Is that error coming from Manubot or my script?

@mprobson
Copy link
Collaborator

It's definitely a system configuration issue on my end, although I am not sure what changed... I have some weird combination of Tmux, Conda, and MacOS... I re-ran on a known good commit and it still fails. conda deactivate/activate doesn't fix it nor does destroying and recreating the environment. I imagine it's just something I need to debug on my own.

Here's the command plus error:

(manubot) C02YD02QJGH8:covid19-review mrobso01$ BUILD_HTML=false BUILD_PDF=false BUILD_INDIVIDUAL=true build/build.sh
Traceback (most recent call last):
  File "<string>", line 1, in <module>
KeyError: 'commit'

@mprobson
Copy link
Collaborator

mprobson commented Apr 30, 2021

I think I figured out what's happening... I seem to be dying on this line:

EXTERNAL_RESOURCES_COMMIT=$(curl -sS https://api.github.com/repos/greenelab/covid19-review/branches/external-resources | python -c "import sys, json; print(json.load(sys.stdin)['commit']['sha'])")

@rando2 and I share an IP address currently and she's getting rate limited by the GitHub API so I suspect I am now being throttled too...

Update: Connecting to my VPN resolves the issue!

@AppVeyorBot
Copy link

AppVeyor build 1.0.4083 for commit 3baa06f is now complete.

Found 16 potential spelling error(s). Preview:content/09.evolution.md:83:nonsynonymous
content/09.evolution.md:139:LVNA
content/23.vaccines-app.md:15:IgGs
content/23.vaccines-app.md:387:IgGs
content/60.methods.md:65:ECRs
content/60.methods.md:81:Manubot's
content/60.methods.md:111:scite
content/60.methods.md:111:Scite
content/60.methods.md:147:scite
content/60.methods.md:164:ECR
content/60.methods.md:165:ECRs
content/60.methods.md:169:docx
content/60.methods.md:171:s...
The rendered manuscript from this build is temporarily available for download at:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Methods Strategies for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants