unified-doc-cli for programmatic manipulation of any file on the web #1

chrisrzhou · 2020-08-09T19:33:23Z

This idea will most likely be implemented in unified-doc-cli

Goals

The internet is a connection of files. unified-doc aims to bridge working with different files with unified document APIs. With a CLI implemented in unified-doc-cli, this will allow us to programmatically crawl/curl through web files and perform various useful processing on them e.g.

Searching for content
Sanitizing content
Extracting just the textContent (useful for NLP pipelines).
Parse to hast and continue content processing with hast utilities in the unified ecosystem.
Outputting source file in different format (.html, .txt, and eventually .pdf and .docx etc).
Enrich source file by attaching plugins, annotations etc.

Config file

Maybe a .unirc.js file? This config basically provides the input for unified-doc. You can attach/override default parsers/plugins/search-algorithms.

// default config
module.exports = {};  // just that!

// custom config
module.exports = {
  parsers: {
    docx: myDocxParser,
  },
  compiler: myCompiler,
  sanitizeSchema: mySanitizeSchema,
  searchAlgorithm: mySearchAlgorithm
}

CLI wrapper around API methods

The entry point for the CLI should be either:

a local filepath
web URL
string data

From this entry point, we can determine the content and filename accordingly.

CLI wrapper should intuitively wrap familiar API methods.

# output files (source, txt, html)
unified-doc https://some-webpage.html --file  # doc.file()
unified-doc https://some-webpage.html --file txt  # doc.file('.txt')
unified-doc https://some-webpage.html --file .html  # doc.file('.html')

# search file
unified-doc https://some-webpage.html --search 'spongebob'  --options ...  # doc.search('spongebob', options)

# text content
unified-doc https://some-webpage.html --text-content'  # doc.textContent()

# parse hast
unified-doc https://some-webpage.html --parse'  # doc.parse()

Ideally, CLI APIs should be pipeable, allowing shell scripting. I'm not great with shell commands, but some pseudocode to demonstrate the ideas:

unified-doc https://some-webpage.html --text-content'  > myfile.txt

# repipe search results as annotations to the same file, and save the final html file
unified-doc https://some-webpage.html --search 'spongebob'  >>> --annotate SEARCH_RESULTS >>> --file .html  # HTML file saved with annotations.

Bulk processing

The CLI should define a way to specify a glob pattern of webpages, crawl through them, and bulk process them, keeping track of errors and allowing a way to access processed files.

The text was updated successfully, but these errors were encountered:

chrisrzhou · 2020-08-09T19:47:30Z

This part of the project excites me the most, given it's immediate value once implemented.

Unfortunately I have a non-existent experience with writing CLI libraries. I would be tackling this in the future and ramping up on my personal knowledge, but any help/advice from the community is greatly appreciated here.

chrisrzhou added idea help wanted Extra attention is needed labels Aug 9, 2020

chrisrzhou mentioned this issue Aug 9, 2020

Docs are outdated unified-doc/docs#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unified-doc-cli for programmatic manipulation of any file on the web #1

unified-doc-cli for programmatic manipulation of any file on the web #1

chrisrzhou commented Aug 9, 2020 •

edited

Loading

chrisrzhou commented Aug 9, 2020

unified-doc-cli for programmatic manipulation of any file on the web #1

unified-doc-cli for programmatic manipulation of any file on the web #1

Comments

chrisrzhou commented Aug 9, 2020 • edited Loading

Goals

Config file

CLI wrapper around API methods

Bulk processing

chrisrzhou commented Aug 9, 2020

chrisrzhou commented Aug 9, 2020 •

edited

Loading