Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unified-doc-cli for programmatic manipulation of any file on the web #1

Open
chrisrzhou opened this issue Aug 9, 2020 · 1 comment
Labels
help wanted Extra attention is needed idea

Comments

@chrisrzhou
Copy link
Member

chrisrzhou commented Aug 9, 2020

This idea will most likely be implemented in unified-doc-cli

Goals

The internet is a connection of files. unified-doc aims to bridge working with different files with unified document APIs. With a CLI implemented in unified-doc-cli, this will allow us to programmatically crawl/curl through web files and perform various useful processing on them e.g.

  • Searching for content
  • Sanitizing content
  • Extracting just the textContent (useful for NLP pipelines).
  • Parse to hast and continue content processing with hast utilities in the unified ecosystem.
  • Outputting source file in different format (.html, .txt, and eventually .pdf and .docx etc).
  • Enrich source file by attaching plugins, annotations etc.

Config file

Maybe a .unirc.js file? This config basically provides the input for unified-doc. You can attach/override default parsers/plugins/search-algorithms.

// default config
module.exports = {};  // just that!

// custom config
module.exports = {
  parsers: {
    docx: myDocxParser,
  },
  compiler: myCompiler,
  sanitizeSchema: mySanitizeSchema,
  searchAlgorithm: mySearchAlgorithm
}

CLI wrapper around API methods

The entry point for the CLI should be either:

  • a local filepath
  • web URL
  • string data

From this entry point, we can determine the content and filename accordingly.

CLI wrapper should intuitively wrap familiar API methods.

# output files (source, txt, html)
unified-doc https://some-webpage.html --file  # doc.file()
unified-doc https://some-webpage.html --file txt  # doc.file('.txt')
unified-doc https://some-webpage.html --file .html  # doc.file('.html')

# search file
unified-doc https://some-webpage.html --search 'spongebob'  --options ...  # doc.search('spongebob', options)

# text content
unified-doc https://some-webpage.html --text-content'  # doc.textContent()

# parse hast
unified-doc https://some-webpage.html --parse'  # doc.parse()

Ideally, CLI APIs should be pipeable, allowing shell scripting. I'm not great with shell commands, but some pseudocode to demonstrate the ideas:

unified-doc https://some-webpage.html --text-content'  > myfile.txt

# repipe search results as annotations to the same file, and save the final html file
unified-doc https://some-webpage.html --search 'spongebob'  >>> --annotate SEARCH_RESULTS >>> --file .html  # HTML file saved with annotations.

Bulk processing

The CLI should define a way to specify a glob pattern of webpages, crawl through them, and bulk process them, keeping track of errors and allowing a way to access processed files.

@chrisrzhou chrisrzhou added idea help wanted Extra attention is needed labels Aug 9, 2020
@chrisrzhou
Copy link
Member Author

This part of the project excites me the most, given it's immediate value once implemented.

Unfortunately I have a non-existent experience with writing CLI libraries. I would be tackling this in the future and ramping up on my personal knowledge, but any help/advice from the community is greatly appreciated here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed idea
Projects
None yet
Development

No branches or pull requests

1 participant