Merge pull request #262 from marklogic/release/1.1.0

Merge release/1.1.0 into main
marklogic · Oct 2, 2024 · 6257b6c · 6257b6c
2 parents 9f6a264 + 6378608
commit 6257b6c
Show file tree

Hide file tree

Showing 53 changed files with 464 additions and 132 deletions.
diff --git a/.gitignore b/.gitignore
@@ -12,3 +12,4 @@ flux/conf
 flux/export
 export
 flux-version.properties
+docker/sonarqube
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -1,10 +1,11 @@
 To contribute to this project, complete these steps to setup a MarkLogic instance via Docker with a test 
 application installed:
 
-1. Clone this repository if you have not already.
-2. From the root directory of the project, run `docker-compose up -d --build`.
-3. Wait 10 to 20 seconds and verify that <http://localhost:8001> shows the MarkLogic admin screen before proceeding.
-4. Run `./gradlew -i mlDeploy` to deploy this project's test application (note that Java 11 or Java 17 is required).
+1. Ensure you have Java 11 or higher installed; you will need Java 17 if you wish to use the Sonarqube support described below.
+2. Clone this repository if you have not already.
+3. From the root directory of the project, run `docker-compose up -d --build`.
+4. Wait 10 to 20 seconds and verify that <http://localhost:8001> shows the MarkLogic admin screen before proceeding.
+5. Run `./gradlew -i mlDeploy` to deploy this project's test application.
 
 Some of the tests depend on the Postgres instance deployed via Docker. Follow these steps to load a sample dataset
 into it:
@@ -57,10 +58,7 @@ publishing a local snapshot of our Spark connector. Then just run:
 
     ./gradlew clean test
 
-You can run the tests using either Java 11 or Java 17. 
-
-In Intellij, the tests will run with Java 11. In order to run the tests in Intellij using Java 17, 
-perform the following steps:
+If you are running the tests in Intellij with Java 17, you will need to perform the following steps:
 
 1. Go to Run -> Edit Configurations in the Intellij toolbar.
 2. Click on "Edit configuration templates".
@@ -81,7 +79,7 @@ delete that configuration first via the "Run -> Edit Configurations" panel.
 ## Generating code quality reports with SonarQube
 
 In order to use SonarQube, you must have used Docker to run this project's `docker-compose.yml` file, and you must
-have the services in that file running.
+have the services in that file running. You must also use Java 17 to run the `sonar` Gradle task. 
 
 To configure the SonarQube service, perform the following steps:
 
@@ -97,8 +95,8 @@ To configure the SonarQube service, perform the following steps:
 10. Add `systemProp.sonar.token=your token pasted here` to `gradle-local.properties` in the root of your project, creating
     that file if it does not exist yet.
 
-To run SonarQube, run the following Gradle tasks, which will run all the tests with code coverage and then generate
-a quality report with SonarQube:
+To run SonarQube, run the following Gradle tasks with Java 17 or higher, which will run all the tests with code 
+coverage and then generate a quality report with SonarQube:
 
     ./gradlew test sonar
 
@@ -116,7 +114,8 @@ before, then SonarQube will show "New Code" by default. That's handy, as you can
 you've introduced on the feature branch you're working on. You can then click on "Overall Code" to see all issues.
 
 Note that if you only need results on code smells and vulnerabilities, you can repeatedly run `./gradlew sonar`
-without having to re-run the tests.
+without having to re-run the tests. If you get an error from Sonar about Java sources, you just need to compile the 
+Java code, so run `./gradlew compileTestJava sonar`. 
 
 ## Testing the documentation locally
 
@@ -229,7 +228,7 @@ Set `SPARK_HOME` to the location of Spark - e.g. `/Users/myname/.sdkman/candidat
 
 Next, start a Spark master node:
 
-    cd $SPARK_HOME/bin
+    cd $SPARK_HOME/sbin
     start-master.sh
 
 You will need the address at which the Spark master node can be reached. To find it, open the log file that Spark
@@ -257,15 +256,15 @@ are all synonyms):
 
     ./gradlew shadowJar
 
-This will produce an assembly jar at `./flux-cli/build/libs/marklogic-flux-1.0.0-all.jar`.
+This will produce an assembly jar at `./flux-cli/build/libs/marklogic-flux-1.1.0-all.jar`.
 
 You can now run any CLI command via spark-submit. This is an example of previewing an import of files - change the value
 of `--path`, as an absolute path is needed, and of course change the value of `--master` to match that of your Spark
 cluster:
 
 ```
 $SPARK_HOME/bin/spark-submit --class com.marklogic.flux.spark.Submit \
---master spark://NYWHYC3G0W:7077 flux-cli/build/libs/marklogic-flux-1.0.0-all.jar \
+--master spark://NYWHYC3G0W:7077 flux-cli/build/libs/marklogic-flux-1.1.0-all.jar \
 import-files --path /Users/rudin/workspace/flux/flux-cli/src/test/resources/mixed-files \
 --connection-string "admin:admin@localhost:8000" \
 --preview 5 --preview-drop content
@@ -282,7 +281,7 @@ to something you can access):
 $SPARK_HOME/bin/spark-submit --class com.marklogic.flux.spark.Submit \
 --packages org.apache.hadoop:hadoop-aws:3.3.4,org.apache.hadoop:hadoop-client:3.3.4 \
 --master spark://NYWHYC3G0W:7077 \
-flux-cli/build/libs/marklogic-flux-1.0.0-all.jar \
+flux-cli/build/libs/marklogic-flux-1.1.0-all.jar \
 import-files --path "s3a://changeme/" \
 --connection-string "admin:admin@localhost:8000" \
 --s3-add-credentials \

diff --git a/NOTICE.txt b/NOTICE.txt
@@ -6,13 +6,13 @@ To the extent required by the applicable open-source license, a complete machine
 
 Third Party Notices
 
-aws-java-sdk-s3 1.12.367 (Apache-2.0)
-hadoop-aws 3.3.6 (Apache-2.0)
-hadoop-client 3.3.6 (Apache-2.0)
-marklogic-spark-connector 2.3.0 (Apache-2.0)
+aws-java-sdk-s3 1.12.262 (Apache-2.0)
+hadoop-aws 3.3.4 (Apache-2.0)
+hadoop-client 3.3.4 (Apache-2.0)
+marklogic-spark-connector 2.4.0 (Apache-2.0)
 picocli 4.7.6 (Apache-2.0)
-spark-avro_2.12 3.4.3 (Apache-2.0)
-spark-sql_2.12 3.4.3 (Apache-2.0)
+spark-avro_2.12 3.5.3 (Apache-2.0)
+spark-sql_2.12 3.5.3 (Apache-2.0)
 
 Common Licenses
 
@@ -22,32 +22,32 @@ Third-Party Components
 
 The following is a list of the third-party components used by MarkLogic® Flux™ v1 (last updated July 2, 2024):
 
-aws-java-sdk-s3 1.12.367 (Apache-2.0)
+aws-java-sdk-s3 1.12.262 (Apache-2.0)
 https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3
 For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
 
 
-hadoop-aws 3.3.6 (Apache-2.0)
+hadoop-aws 3.3.4 (Apache-2.0)
 https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws
 For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
 
-hadoop-client 3.3.6 (Apache-2.0)
+hadoop-client 3.3.4 (Apache-2.0)
 https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-client
 For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
 
-marklogic-spark-connector 2.3 (Apache-2.0)
+marklogic-spark-connector 2.34.0(Apache-2.0)
 https://repo1.maven.org/maven2/com/marklogic/marklogic-spark-connector
 For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
 
 picocli 4.7.6 (Apache-2.0)
 https://repo1.maven.org/maven2/info/picocli/picocli
 For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
 
-spark-avro_2.12 3.4.3 (Apache-2.0)
+spark-avro_2.12 3.5.3 (Apache-2.0)
 https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12
 For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
 
-spark-sql_2.12 3.4.3 (Apache-2.0)
+spark-sql_2.12 3.5.3 (Apache-2.0)
 https://repo1.maven.org/maven2/org/apache/spark/spark-sql_2.12
 For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)
 

diff --git a/build.gradle b/build.gradle
@@ -29,7 +29,7 @@ task gettingStartedZip(type: Zip) {
   description = "Creates a zip of the getting-started project that is intended to be included as a downloadable file " +
     "on the GitHub release page."
   from "examples/getting-started"
-  exclude "build", ".gradle", "gradle-*.properties", "flux", ".gitignore"
+  exclude "build", ".gradle", "gradle-*.properties", "flux", ".gitignore", "marklogic-flux"
   into "marklogic-flux-getting-started-${version}"
   archiveFileName = "marklogic-flux-getting-started-${version}.zip"
   destinationDirectory = file("build")

diff --git a/docker-compose.yml b/docker-compose.yml
@@ -1,5 +1,3 @@
-version: '3.8'
-
 name: flux
 
 services:
@@ -17,7 +15,7 @@ services:
       - 8007:8007
 
   marklogic:
-    image: "marklogicdb/marklogic-db:11.2.0-centos-1.1.2"
+    image: "progressofficial/marklogic-db:latest"
     platform: linux/amd64
     environment:
       - MARKLOGIC_INIT=true
@@ -55,7 +53,7 @@ services:
 
   # Copied from https://docs.sonarsource.com/sonarqube/latest/setup-and-upgrade/install-the-server/#example-docker-compose-configuration .
   sonarqube:
-    image: sonarqube:10.3.0-community
+    image: sonarqube:10.6.0-community
     depends_on:
       - postgres
     environment:

diff --git a/docs/api.md b/docs/api.md
@@ -22,15 +22,15 @@ To add Flux as a dependency to your application, add the following to your Maven
 <dependency>
   <groupId>com.marklogic</groupId>
   <artifactId>flux-api</artifactId>
-  <version>1.0.0</version>
+  <version>1.1.0</version>
 </dependency>
 ```
 
 Or if you are using Gradle, add the following to your `build.gradle` file:
 
 ```
 dependencies {
-  implementation "com.marklogic:flux-api:1.0.0"
+  implementation "com.marklogic:flux-api:1.1.0"
 }
 ```
 
@@ -97,7 +97,7 @@ buildscript {
     mavenCentral()
   }
   dependencies {
-    classpath "com.marklogic:flux-api:1.0.0"
+    classpath "com.marklogic:flux-api:1.1.0"
   }
 }
 ```
@@ -139,7 +139,7 @@ buildscript {
     mavenCentral()
   }
   dependencies {
-    classpath "com.marklogic:flux-api:1.0.0"
+    classpath "com.marklogic:flux-api:1.1.0"
     classpath("com.marklogic:ml-gradle:4.8.0") {
       exclude group: "com.fasterxml.jackson.databind"
       exclude group: "com.fasterxml.jackson.core"

diff --git a/docs/export/export-archives.md b/docs/export/export-archives.md
@@ -57,21 +57,21 @@ combination of those options as well, with the exception that `--query` will be
 
 You must then use the `--path` option to specify a directory to write archive files to.
 
-### Windows-specific issues with zip files
+### Windows-specific issues with ZIP files
 
-In the likely event that you have one or more URIs with a forward slash - `/` - in them, then creating a zip file
+In the likely event that you have one or more URIs with a forward slash - `/` - in them, then creating a ZIP file
 with those URIs - which are used as the zip entry names - will produce confusing behavior on Windows. If you open the
-zip file via Windows Explorer, Windows will erroneously think the zip file is empty. If you open the zip file using
+ZIP file via Windows Explorer, Windows will erroneously think the file is empty. If you open the file using
 7-Zip, you will see a top-level entry named `_` if one or more of your URIs begin with a forward slash. These are
 effectively issues that only occur when viewing the file within Windows and do not reflect the actual contents of the
-zip file. The contents of the file are correct and if you were to import them with Flux via the `import-archive-files`
+ZIP file. The contents of the file are correct and if you were to import them with Flux via the `import-archive-files`
 command, you will get the expected results.
 
 
 ## Controlling document metadata
 
 Each exported document will have all of its associated metadata - collections, permissions, quality, properties, and 
-metadata values - included in an XML document in the archive zip file. You can control which types of metadata are
+metadata values - included in an XML document in the archive ZIP file. You can control which types of metadata are
 included with the `--categories` option. This option accepts a comma-delimited sequence of the following metadata types:
 
 - `collections`
@@ -120,4 +120,23 @@ bin\flux export-archives ^
 {% endtabs %}
 
 
-The encoding will be used for both document and metadata entries in each archive zip file. 
+The encoding will be used for both document and metadata entries in each archive ZIP file. 
+
+## Exporting large binary files 
+
+Similar to [exporting large binary documents as files](export-documents.md), you can include large binary documents
+in archives by including the `--streaming` option introduced in Flux 1.1.0. When this option is set, Flux will stream 
+each document from MarkLogic directly to a ZIP file, thereby avoiding reading the contents of a file into memory.
+
+As streaming to an archive requires Flux to retrieve one document at a time from MarkLogic, you should not use this option
+when exporting smaller documents that can easily fit into the memory available to Flux.
+
+When using `--streaming`, the following options will behave in a different fashion:
+
+- `--batch-size` will still affect how many URIs are retrieved from MarkLogic in a single request, but will not impact
+  the number of documents retrieved from MarkLogic in a single request, which will always be 1.
+- `--encoding` will be ignored as applying an encoding requires reading the document into memory.
+- `--pretty-print` will have no effect as the contents of a document will never be read into memory.
+
+You typically will not want to use the `--transform` option as applying a REST transform in MarkLogic to a
+large binary document may exhaust the amount of memory available to MarkLogic.
diff --git a/docs/export/export-documents.md b/docs/export/export-documents.md
@@ -157,22 +157,22 @@ To use the above transform, verify that your user has been granted the MarkLogic
 
 ## Compressing content
 
-The `--compression` option is used to write files either to Gzip or ZIP files. 
+The `--compression` option is used to write files either to gzip or ZIP files. 
 
-To Gzip each file, include `--compression GZIP`. 
+To gzip each file, include `--compression GZIP`. 
 
-To write multiple files to one or more ZIP files, include `--compression ZIP`. A zip file will be created for each 
+To write multiple files to one or more ZIP files, include `--compression ZIP`. A ZIP file will be created for each 
 partition that was created when reading data via Optic. You can include `--zip-file-count 1` to force all documents to be
 written to a single ZIP file. See the below section on "Understanding partitions" for more information. 
 
-### Windows-specific issues with zip files
+### Windows-specific issues with ZIP files
 
-In the likely event that you have one or more URIs with a forward slash - `/` - in them, then creating a zip file
+In the likely event that you have one or more URIs with a forward slash - `/` - in them, then creating a ZIP file
 with those URIs - which are used as the zip entry names - will produce confusing behavior on Windows. If you open the
-zip file via Windows Explorer, Windows will erroneously think the zip file is empty. If you open the zip file using
+ZIP file via Windows Explorer, Windows will erroneously think the file is empty. If you open the file using
 7-Zip, you will see a top-level entry named `_` if one or more of your URIs begin with a forward slash. These are
 effectively issues that only occur when viewing the file within Windows and do not reflect the actual contents of the
-zip file. The contents of the file are correct and if you were to import them with Flux via the `import-files` 
+ZIP file. The contents of the file are correct and if you were to import them with Flux via the `import-files` 
 command, you will get the expected results.
 
 ## Specifying an encoding
@@ -202,6 +202,27 @@ bin\flux export-files ^
 {% endtabs %}
 
 
+## Exporting large binary documents
+
+MarkLogic's [support for large binary documents](https://docs.marklogic.com/guide/app-dev/binaries#id_93203) allows 
+for storing binary files of any size. To ensure that large binary documents can be exported to a file path, consider
+using the `--streaming` option introduced in Flux 1.1.0. When this option is set, Flux will stream each document
+from MarkLogic directly to the file path, thereby avoiding reading the contents of a file into memory. This option 
+can be used when exporting documents to gzip or ZIP files as well via the `--compression zip` option. 
+
+As streaming to a file requires Flux to retrieve one document at a time from MarkLogic, you should not use this option
+when exporting smaller documents that can easily fit into the memory available to Flux. 
+
+When using `--streaming`, the following options will behave in a different fashion:
+
+- `--batch-size` will still affect how many URIs are retrieved from MarkLogic in a single request, but will not impact
+the number of documents retrieved from MarkLogic in a single request, which will always be 1.
+- `--encoding` will be ignored as applying an encoding requires reading the document into memory.
+- `--pretty-print` will have no effect as the contents of a document will never be read into memory.
+
+You typically will not want to use the `--transform` option as applying a REST transform in MarkLogic to a 
+large binary document may exhaust the amount of memory available to MarkLogic.
+
 ## Understanding partitions
 
 As Flux is built on top of Apache Spark, it is heavily influenced by how Spark
@@ -237,9 +258,9 @@ bin\flux export-files ^
 {% endtab %}
 {% endtabs %}
 
-The `./export` directory will have 12 zip files in it. This count is due to how Flux reads data from MarkLogic,
+The `./export` directory will have 12 ZIP files in it. This count is due to how Flux reads data from MarkLogic,
 which involves creating 4 partitions by default per forest in the MarkLogic database. The example application has 3
-forests in its content database, and thus 12 partitions are created, resulting in 12 separate zip files.
+forests in its content database, and thus 12 partitions are created, resulting in 12 separate ZIP files.
 
 You can use the `--partitions-per-forest` option to control how many partitions - and thus workers - read documents
 from each forest in your database:
@@ -272,7 +293,7 @@ bin\flux export-files ^
 {% endtabs %}
 
 
-This approach will produce 3 zip files - one per forest.
+This approach will produce 3 ZIP files - one per forest.
 
 You can also use the `--repartition` option, available on every command, to force the number of partitions used when
 writing data, regardless of how many were used to read the data:
@@ -303,7 +324,7 @@ bin\flux export-files ^
 {% endtabs %}
 
 
-This approach will produce a single zip file due to the use of a single partition when writing files. 
+This approach will produce a single ZIP file due to the use of a single partition when writing files. 
 The `--zip-file-count` option is effectively an alias for `--repartition`. Both options produce the same outcome. 
 `--zip-file-count` is included as a more intuitive option for the common case of configuring how many files should
 be written. 

diff --git a/docs/export/export-rdf.md b/docs/export/export-rdf.md
@@ -86,6 +86,6 @@ For some use cases involving exporting triples with their graphs to files contai
 reference the graph that each triple belongs to in MarkLogic. You can use `--graph-override` to specify an alternative
 graph value that will then be associated with every triple that Flux writes to a file. 
 
-## GZIP compression
+## gzip compression
 
-To compress each file written by Flux using GZIP, simply include `--gzip` as an option.
+To compress each file written by Flux using gzip, simply include `--gzip` as an option.
diff --git a/docs/export/export-rows.md b/docs/export/export-rows.md
@@ -311,6 +311,10 @@ location where data already exists. This option supports the following values:
 
 For convenience, the above values are case-sensitive so that you can ignore casing when choosing a value. 
 
+As of the 1.1.0 release of Flux, `--mode` defaults to `Append` for commands that write to a filesystem. In the 1.0.0
+release, these commands defaulted to `Overwrite`. The `export-jdbc` command defaults to `ErrorIfExists` avoid altering
+an existing table in any way.
+
 For further information on each mode, please see 
 [the Spark documentation](https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#save-modes).