Skip to content

Commit

Permalink
Merge pull request #262 from marklogic/release/1.1.0
Browse files Browse the repository at this point in the history
Merge release/1.1.0 into main
  • Loading branch information
rjrudin authored Oct 2, 2024
2 parents 9f6a264 + 6378608 commit 6257b6c
Show file tree
Hide file tree
Showing 53 changed files with 464 additions and 132 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,4 @@ flux/conf
flux/export
export
flux-version.properties
docker/sonarqube
31 changes: 15 additions & 16 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
To contribute to this project, complete these steps to setup a MarkLogic instance via Docker with a test
application installed:

1. Clone this repository if you have not already.
2. From the root directory of the project, run `docker-compose up -d --build`.
3. Wait 10 to 20 seconds and verify that <http://localhost:8001> shows the MarkLogic admin screen before proceeding.
4. Run `./gradlew -i mlDeploy` to deploy this project's test application (note that Java 11 or Java 17 is required).
1. Ensure you have Java 11 or higher installed; you will need Java 17 if you wish to use the Sonarqube support described below.
2. Clone this repository if you have not already.
3. From the root directory of the project, run `docker-compose up -d --build`.
4. Wait 10 to 20 seconds and verify that <http://localhost:8001> shows the MarkLogic admin screen before proceeding.
5. Run `./gradlew -i mlDeploy` to deploy this project's test application.

Some of the tests depend on the Postgres instance deployed via Docker. Follow these steps to load a sample dataset
into it:
Expand Down Expand Up @@ -57,10 +58,7 @@ publishing a local snapshot of our Spark connector. Then just run:

./gradlew clean test

You can run the tests using either Java 11 or Java 17.

In Intellij, the tests will run with Java 11. In order to run the tests in Intellij using Java 17,
perform the following steps:
If you are running the tests in Intellij with Java 17, you will need to perform the following steps:

1. Go to Run -> Edit Configurations in the Intellij toolbar.
2. Click on "Edit configuration templates".
Expand All @@ -81,7 +79,7 @@ delete that configuration first via the "Run -> Edit Configurations" panel.
## Generating code quality reports with SonarQube

In order to use SonarQube, you must have used Docker to run this project's `docker-compose.yml` file, and you must
have the services in that file running.
have the services in that file running. You must also use Java 17 to run the `sonar` Gradle task.

To configure the SonarQube service, perform the following steps:

Expand All @@ -97,8 +95,8 @@ To configure the SonarQube service, perform the following steps:
10. Add `systemProp.sonar.token=your token pasted here` to `gradle-local.properties` in the root of your project, creating
that file if it does not exist yet.

To run SonarQube, run the following Gradle tasks, which will run all the tests with code coverage and then generate
a quality report with SonarQube:
To run SonarQube, run the following Gradle tasks with Java 17 or higher, which will run all the tests with code
coverage and then generate a quality report with SonarQube:

./gradlew test sonar

Expand All @@ -116,7 +114,8 @@ before, then SonarQube will show "New Code" by default. That's handy, as you can
you've introduced on the feature branch you're working on. You can then click on "Overall Code" to see all issues.

Note that if you only need results on code smells and vulnerabilities, you can repeatedly run `./gradlew sonar`
without having to re-run the tests.
without having to re-run the tests. If you get an error from Sonar about Java sources, you just need to compile the
Java code, so run `./gradlew compileTestJava sonar`.

## Testing the documentation locally

Expand Down Expand Up @@ -229,7 +228,7 @@ Set `SPARK_HOME` to the location of Spark - e.g. `/Users/myname/.sdkman/candidat

Next, start a Spark master node:

cd $SPARK_HOME/bin
cd $SPARK_HOME/sbin
start-master.sh

You will need the address at which the Spark master node can be reached. To find it, open the log file that Spark
Expand Down Expand Up @@ -257,15 +256,15 @@ are all synonyms):

./gradlew shadowJar

This will produce an assembly jar at `./flux-cli/build/libs/marklogic-flux-1.0.0-all.jar`.
This will produce an assembly jar at `./flux-cli/build/libs/marklogic-flux-1.1.0-all.jar`.

You can now run any CLI command via spark-submit. This is an example of previewing an import of files - change the value
of `--path`, as an absolute path is needed, and of course change the value of `--master` to match that of your Spark
cluster:

```
$SPARK_HOME/bin/spark-submit --class com.marklogic.flux.spark.Submit \
--master spark://NYWHYC3G0W:7077 flux-cli/build/libs/marklogic-flux-1.0.0-all.jar \
--master spark://NYWHYC3G0W:7077 flux-cli/build/libs/marklogic-flux-1.1.0-all.jar \
import-files --path /Users/rudin/workspace/flux/flux-cli/src/test/resources/mixed-files \
--connection-string "admin:admin@localhost:8000" \
--preview 5 --preview-drop content
Expand All @@ -282,7 +281,7 @@ to something you can access):
$SPARK_HOME/bin/spark-submit --class com.marklogic.flux.spark.Submit \
--packages org.apache.hadoop:hadoop-aws:3.3.4,org.apache.hadoop:hadoop-client:3.3.4 \
--master spark://NYWHYC3G0W:7077 \
flux-cli/build/libs/marklogic-flux-1.0.0-all.jar \
flux-cli/build/libs/marklogic-flux-1.1.0-all.jar \
import-files --path "s3a://changeme/" \
--connection-string "admin:admin@localhost:8000" \
--s3-add-credentials \
Expand Down
24 changes: 12 additions & 12 deletions NOTICE.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,13 +6,13 @@ To the extent required by the applicable open-source license, a complete machine

Third Party Notices

aws-java-sdk-s3 1.12.367 (Apache-2.0)
hadoop-aws 3.3.6 (Apache-2.0)
hadoop-client 3.3.6 (Apache-2.0)
marklogic-spark-connector 2.3.0 (Apache-2.0)
aws-java-sdk-s3 1.12.262 (Apache-2.0)
hadoop-aws 3.3.4 (Apache-2.0)
hadoop-client 3.3.4 (Apache-2.0)
marklogic-spark-connector 2.4.0 (Apache-2.0)
picocli 4.7.6 (Apache-2.0)
spark-avro_2.12 3.4.3 (Apache-2.0)
spark-sql_2.12 3.4.3 (Apache-2.0)
spark-avro_2.12 3.5.3 (Apache-2.0)
spark-sql_2.12 3.5.3 (Apache-2.0)

Common Licenses

Expand All @@ -22,32 +22,32 @@ Third-Party Components

The following is a list of the third-party components used by MarkLogic® Flux™ v1 (last updated July 2, 2024):

aws-java-sdk-s3 1.12.367 (Apache-2.0)
aws-java-sdk-s3 1.12.262 (Apache-2.0)
https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-s3
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)


hadoop-aws 3.3.6 (Apache-2.0)
hadoop-aws 3.3.4 (Apache-2.0)
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)

hadoop-client 3.3.6 (Apache-2.0)
hadoop-client 3.3.4 (Apache-2.0)
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-client
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)

marklogic-spark-connector 2.3 (Apache-2.0)
marklogic-spark-connector 2.34.0(Apache-2.0)
https://repo1.maven.org/maven2/com/marklogic/marklogic-spark-connector
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)

picocli 4.7.6 (Apache-2.0)
https://repo1.maven.org/maven2/info/picocli/picocli
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)

spark-avro_2.12 3.4.3 (Apache-2.0)
spark-avro_2.12 3.5.3 (Apache-2.0)
https://repo1.maven.org/maven2/org/apache/spark/spark-avro_2.12
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)

spark-sql_2.12 3.4.3 (Apache-2.0)
spark-sql_2.12 3.5.3 (Apache-2.0)
https://repo1.maven.org/maven2/org/apache/spark/spark-sql_2.12
For the full text of the Apache-2.0 license, see Apache License 2.0 (Apache-2.0)

Expand Down
2 changes: 1 addition & 1 deletion build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ task gettingStartedZip(type: Zip) {
description = "Creates a zip of the getting-started project that is intended to be included as a downloadable file " +
"on the GitHub release page."
from "examples/getting-started"
exclude "build", ".gradle", "gradle-*.properties", "flux", ".gitignore"
exclude "build", ".gradle", "gradle-*.properties", "flux", ".gitignore", "marklogic-flux"
into "marklogic-flux-getting-started-${version}"
archiveFileName = "marklogic-flux-getting-started-${version}.zip"
destinationDirectory = file("build")
Expand Down
6 changes: 2 additions & 4 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,3 @@
version: '3.8'

name: flux

services:
Expand All @@ -17,7 +15,7 @@ services:
- 8007:8007

marklogic:
image: "marklogicdb/marklogic-db:11.2.0-centos-1.1.2"
image: "progressofficial/marklogic-db:latest"
platform: linux/amd64
environment:
- MARKLOGIC_INIT=true
Expand Down Expand Up @@ -55,7 +53,7 @@ services:

# Copied from https://docs.sonarsource.com/sonarqube/latest/setup-and-upgrade/install-the-server/#example-docker-compose-configuration .
sonarqube:
image: sonarqube:10.3.0-community
image: sonarqube:10.6.0-community
depends_on:
- postgres
environment:
Expand Down
8 changes: 4 additions & 4 deletions docs/api.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,15 +22,15 @@ To add Flux as a dependency to your application, add the following to your Maven
<dependency>
<groupId>com.marklogic</groupId>
<artifactId>flux-api</artifactId>
<version>1.0.0</version>
<version>1.1.0</version>
</dependency>
```

Or if you are using Gradle, add the following to your `build.gradle` file:

```
dependencies {
implementation "com.marklogic:flux-api:1.0.0"
implementation "com.marklogic:flux-api:1.1.0"
}
```

Expand Down Expand Up @@ -97,7 +97,7 @@ buildscript {
mavenCentral()
}
dependencies {
classpath "com.marklogic:flux-api:1.0.0"
classpath "com.marklogic:flux-api:1.1.0"
}
}
```
Expand Down Expand Up @@ -139,7 +139,7 @@ buildscript {
mavenCentral()
}
dependencies {
classpath "com.marklogic:flux-api:1.0.0"
classpath "com.marklogic:flux-api:1.1.0"
classpath("com.marklogic:ml-gradle:4.8.0") {
exclude group: "com.fasterxml.jackson.databind"
exclude group: "com.fasterxml.jackson.core"
Expand Down
31 changes: 25 additions & 6 deletions docs/export/export-archives.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,21 +57,21 @@ combination of those options as well, with the exception that `--query` will be

You must then use the `--path` option to specify a directory to write archive files to.

### Windows-specific issues with zip files
### Windows-specific issues with ZIP files

In the likely event that you have one or more URIs with a forward slash - `/` - in them, then creating a zip file
In the likely event that you have one or more URIs with a forward slash - `/` - in them, then creating a ZIP file
with those URIs - which are used as the zip entry names - will produce confusing behavior on Windows. If you open the
zip file via Windows Explorer, Windows will erroneously think the zip file is empty. If you open the zip file using
ZIP file via Windows Explorer, Windows will erroneously think the file is empty. If you open the file using
7-Zip, you will see a top-level entry named `_` if one or more of your URIs begin with a forward slash. These are
effectively issues that only occur when viewing the file within Windows and do not reflect the actual contents of the
zip file. The contents of the file are correct and if you were to import them with Flux via the `import-archive-files`
ZIP file. The contents of the file are correct and if you were to import them with Flux via the `import-archive-files`
command, you will get the expected results.


## Controlling document metadata

Each exported document will have all of its associated metadata - collections, permissions, quality, properties, and
metadata values - included in an XML document in the archive zip file. You can control which types of metadata are
metadata values - included in an XML document in the archive ZIP file. You can control which types of metadata are
included with the `--categories` option. This option accepts a comma-delimited sequence of the following metadata types:

- `collections`
Expand Down Expand Up @@ -120,4 +120,23 @@ bin\flux export-archives ^
{% endtabs %}


The encoding will be used for both document and metadata entries in each archive zip file.
The encoding will be used for both document and metadata entries in each archive ZIP file.

## Exporting large binary files

Similar to [exporting large binary documents as files](export-documents.md), you can include large binary documents
in archives by including the `--streaming` option introduced in Flux 1.1.0. When this option is set, Flux will stream
each document from MarkLogic directly to a ZIP file, thereby avoiding reading the contents of a file into memory.

As streaming to an archive requires Flux to retrieve one document at a time from MarkLogic, you should not use this option
when exporting smaller documents that can easily fit into the memory available to Flux.

When using `--streaming`, the following options will behave in a different fashion:

- `--batch-size` will still affect how many URIs are retrieved from MarkLogic in a single request, but will not impact
the number of documents retrieved from MarkLogic in a single request, which will always be 1.
- `--encoding` will be ignored as applying an encoding requires reading the document into memory.
- `--pretty-print` will have no effect as the contents of a document will never be read into memory.

You typically will not want to use the `--transform` option as applying a REST transform in MarkLogic to a
large binary document may exhaust the amount of memory available to MarkLogic.
43 changes: 32 additions & 11 deletions docs/export/export-documents.md
Original file line number Diff line number Diff line change
Expand Up @@ -157,22 +157,22 @@ To use the above transform, verify that your user has been granted the MarkLogic

## Compressing content

The `--compression` option is used to write files either to Gzip or ZIP files.
The `--compression` option is used to write files either to gzip or ZIP files.

To Gzip each file, include `--compression GZIP`.
To gzip each file, include `--compression GZIP`.

To write multiple files to one or more ZIP files, include `--compression ZIP`. A zip file will be created for each
To write multiple files to one or more ZIP files, include `--compression ZIP`. A ZIP file will be created for each
partition that was created when reading data via Optic. You can include `--zip-file-count 1` to force all documents to be
written to a single ZIP file. See the below section on "Understanding partitions" for more information.

### Windows-specific issues with zip files
### Windows-specific issues with ZIP files

In the likely event that you have one or more URIs with a forward slash - `/` - in them, then creating a zip file
In the likely event that you have one or more URIs with a forward slash - `/` - in them, then creating a ZIP file
with those URIs - which are used as the zip entry names - will produce confusing behavior on Windows. If you open the
zip file via Windows Explorer, Windows will erroneously think the zip file is empty. If you open the zip file using
ZIP file via Windows Explorer, Windows will erroneously think the file is empty. If you open the file using
7-Zip, you will see a top-level entry named `_` if one or more of your URIs begin with a forward slash. These are
effectively issues that only occur when viewing the file within Windows and do not reflect the actual contents of the
zip file. The contents of the file are correct and if you were to import them with Flux via the `import-files`
ZIP file. The contents of the file are correct and if you were to import them with Flux via the `import-files`
command, you will get the expected results.

## Specifying an encoding
Expand Down Expand Up @@ -202,6 +202,27 @@ bin\flux export-files ^
{% endtabs %}


## Exporting large binary documents

MarkLogic's [support for large binary documents](https://docs.marklogic.com/guide/app-dev/binaries#id_93203) allows
for storing binary files of any size. To ensure that large binary documents can be exported to a file path, consider
using the `--streaming` option introduced in Flux 1.1.0. When this option is set, Flux will stream each document
from MarkLogic directly to the file path, thereby avoiding reading the contents of a file into memory. This option
can be used when exporting documents to gzip or ZIP files as well via the `--compression zip` option.

As streaming to a file requires Flux to retrieve one document at a time from MarkLogic, you should not use this option
when exporting smaller documents that can easily fit into the memory available to Flux.

When using `--streaming`, the following options will behave in a different fashion:

- `--batch-size` will still affect how many URIs are retrieved from MarkLogic in a single request, but will not impact
the number of documents retrieved from MarkLogic in a single request, which will always be 1.
- `--encoding` will be ignored as applying an encoding requires reading the document into memory.
- `--pretty-print` will have no effect as the contents of a document will never be read into memory.

You typically will not want to use the `--transform` option as applying a REST transform in MarkLogic to a
large binary document may exhaust the amount of memory available to MarkLogic.

## Understanding partitions

As Flux is built on top of Apache Spark, it is heavily influenced by how Spark
Expand Down Expand Up @@ -237,9 +258,9 @@ bin\flux export-files ^
{% endtab %}
{% endtabs %}

The `./export` directory will have 12 zip files in it. This count is due to how Flux reads data from MarkLogic,
The `./export` directory will have 12 ZIP files in it. This count is due to how Flux reads data from MarkLogic,
which involves creating 4 partitions by default per forest in the MarkLogic database. The example application has 3
forests in its content database, and thus 12 partitions are created, resulting in 12 separate zip files.
forests in its content database, and thus 12 partitions are created, resulting in 12 separate ZIP files.

You can use the `--partitions-per-forest` option to control how many partitions - and thus workers - read documents
from each forest in your database:
Expand Down Expand Up @@ -272,7 +293,7 @@ bin\flux export-files ^
{% endtabs %}


This approach will produce 3 zip files - one per forest.
This approach will produce 3 ZIP files - one per forest.

You can also use the `--repartition` option, available on every command, to force the number of partitions used when
writing data, regardless of how many were used to read the data:
Expand Down Expand Up @@ -303,7 +324,7 @@ bin\flux export-files ^
{% endtabs %}


This approach will produce a single zip file due to the use of a single partition when writing files.
This approach will produce a single ZIP file due to the use of a single partition when writing files.
The `--zip-file-count` option is effectively an alias for `--repartition`. Both options produce the same outcome.
`--zip-file-count` is included as a more intuitive option for the common case of configuring how many files should
be written.
Expand Down
4 changes: 2 additions & 2 deletions docs/export/export-rdf.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,6 @@ For some use cases involving exporting triples with their graphs to files contai
reference the graph that each triple belongs to in MarkLogic. You can use `--graph-override` to specify an alternative
graph value that will then be associated with every triple that Flux writes to a file.

## GZIP compression
## gzip compression

To compress each file written by Flux using GZIP, simply include `--gzip` as an option.
To compress each file written by Flux using gzip, simply include `--gzip` as an option.
4 changes: 4 additions & 0 deletions docs/export/export-rows.md
Original file line number Diff line number Diff line change
Expand Up @@ -311,6 +311,10 @@ location where data already exists. This option supports the following values:

For convenience, the above values are case-sensitive so that you can ignore casing when choosing a value.

As of the 1.1.0 release of Flux, `--mode` defaults to `Append` for commands that write to a filesystem. In the 1.0.0
release, these commands defaulted to `Overwrite`. The `export-jdbc` command defaults to `ErrorIfExists` avoid altering
an existing table in any way.

For further information on each mode, please see
[the Spark documentation](https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#save-modes).

Expand Down
Loading

0 comments on commit 6257b6c

Please sign in to comment.