Release v1.2.0 · usc-softarch/arcade_core

DISCLAIMER

Please note that ARCADE_Core is an experimental tool. If you find any bugs or have difficulty at any point in the execution, please contact me (Marcelo) through the link provided in the readme.md.

What is included in this release

ARCADE_Core.jar : A packaged distribution of ARCADE_Core, including all of its dependencies.
Mallet-202108.zip : A distribution of Mallet, which must be unpacked to be used as a fact extractor for ARC.
resources.zip : A directory containing resources used by certain phases of ARC, which must be unpacked in the same directory as ARCADE_Core.jar is run from. It may technically be unpacked anywhere else, but this may cause warnings to be raised during execution.
code-maat-1.0-SNAPSHOT-standalone.jar : A distribution of Code Maat, to be used as a fact extractor.
pmd-bin-5.3.2.zip : A distribution of PMD, to be used as a fact extractor.
apache-ant-1.9.6.zip : A distribution of Apache Ant, to be used in executing PMD.
DependencyFinder.zip : A distribution of Dependency Finder, to be used as a fact extractor.
mkdep.pl and mkfiles.pl : Two Perl scripts that are executed when analyzing C-based systems. They must be placed in the same directory as ARCADE_Core itself.

Requirements

For analyzing C-based systems, you will require a working installation of Perl. In order to use ARC on Windows, you will also need to set up an environment variable MALLET_HOME to point to the root of the mallet distribution provided with ARCADE_Core.

Compiling ARCADE

While a binary is provided, re-compiling is easily achieved using Maven and a JDK 11+. To generate a jar, run the command mvn clean to install ARCADE_Core's local dependencies to your .m2 directory, and then mvn package -Dmaven.test.skip=true to generate the jar inside the target directory.

Known issues

The DependencyFinderProcessing smell detector for Interface-based and Change-based smells only works with Java-based systems. This is due to its dependence on Dependency Finder, a dependency extractor for Java-based systems. Therefore, all fact extraction components related to this smell detector, viz. Code Maat, CodeMaatCleanUp, PMD and DependencyFinder, are not useful when analyzing C-based systems.
Due to limitations in the current version of ARCADE_Core, note that the source directories MUST follow a specific naming convention. Version directories must be named <PROJECT_NAME>-<PROJECT_VERSION>, whereas a root directory containing multiple versions should NOT contain the dash character. For example, if the system under analysis is bash and the versions are 1.14.4 and 4.2, then the directory structure should be:

bash
|- bash-1.14.4
|- bash-4.2

Functionalities

Fact Extraction

CSourceToDepsBuilder: This is the Fact Extractor for C-based systems. It takes the system source as input and outputs a list of dependencies in .RSF format and a serialized FeatureVectors file which is used as input for latter phases. Remember that this component will fail if mkdep.pl and mkfiles.pl (included in this release) are not placed in the same directory as ARCADE_Core's executable jar.

java -cp ARCADE_Core.jar edu.usc.softarch.arcade.facts.dependencies.CSourceToDepsBuilder <PATH_TO_SOURCE> <PATH_TO_RSF_OUTPUT> <PATH_TO_FVECS_OUTPUT>

<PATH_TO_SOURCE> : This should be a path pointing to the root of the system under analysis, so that CSourceToDepsBuilder may locate all source files by searching that directory's subtree.

<PATH_TO_RSF_OUTPUT> : This is the path where the output .RSF should be placed. Necessary directories will be created as long as ARCADE_Core has access permissions to the root directory. The path should include the desired filename.

<PATH_TO_FVECS_OUTPUT> : This is the path where the output .JSON should be placed. Necessary directories will be created as long as ARCADE_Core has access permissions to the root directory. The path should include the desired filename.

Note that this component outputs two result files. While equivalent, they are used by different components of ARCADE: the RSF output is most widely used and is the most human-readable format of the two; the JSON output contains a few extra bits of information which are needed by the Clusterer component, which includes ARC, WCA and Limbo.

JavaSourceToDepsBuilder: This is the Fact Extractor for Java systems, and works similarly to CSourceToDepsBuilder.

java -cp ARCADE_Core.jar edu.usc.softarch.arcade.facts.dependencies.JavaSourceToDepsBuilder <PATH_TO_BINARIES> <PATH_TO_RSF_OUTPUT> <PATH_TO_FVECS_OUTPUT> <PACKAGE_PREFIX>

<PATH_TO_BINARIES> : This should be a path pointing to a directory containing the compiled binaries of the system under analysis. Do note that providing the root of the system under analysis, such as in CSourceToDepsBuilder, will result in an empty output.

<PACKAGE_PREFIX> : A prefix by which to filter the dependencies to only those relevant to the subject system. The fact extractor will omit dependencies from the output unless their package begins with <PACKAGE_PREFIX>. Using the empty string as an argument includes all results to the output, and may improve the results of the clustering phase. To do so, input "" as the final argument.

MalletRunner: This is a driver for executing Mallet correctly for generating ARC's inputs.

java -cp ARCADE_Core.jar edu.usc.softarch.arcade.topics.MalletRunner <PATH_TO_SOURCE> <SOURCE_LANGUAGE> <MALLET_PATH> <ARTIFACTS_DIR> <STOPWORD_DIR>

<PATH_TO_SOURCE>: This should point to the root directory to be analyzed. That is the source directory for a single version, or a super-directory containing the source directory of multiple versions.

<SOURCE_LANGUAGE>: Language of the system under analysis, either c or java. Case-insensitive.

<MALLET_PATH>: Path to the Mallet executable file, either Mallet-202108\bin\mallet.bat for Windows or Mallet-202108\bin\mallet for Linux.

<ARTIFACTS_DIR>: Directory in which to place the output.

<STOPWORD_DIR>: Directory containing javakeywordsexpanded and ckeywordsexpanded. In this distribution, these are contained in the resources.zip, under res.

By default, this driver will execute by creating a temporary copy of the source directory containing only the files of interest, and then removing that copy once execution is finished. This is due to a limitation of Mallet which does not allow the user to specify which file extensions are to be included in the analysis. For that reason, running the driver on very large systems may take a while. In order to potentially speed things up, two optional arguments may be added to the command:

java -cp ARCADE_Core.jar edu.usc.softarch.arcade.topics.MalletRunner <PATH_TO_SOURCE> <SOURCE_LANGUAGE> <MALLET_PATH> <ARTIFACTS_DIR> <STOPWORD_DIR> <COPY_READY> <KEEP_COPY>

<COPY_READY>: Boolean true or false, lower case. Indicates that the copy directory <PATH_TO_SOURCE>_temp already exists. If this argument is set to false or is absent and a directory already exists under the name <PATH_TO_SOURCE>_temp, execution will fail. This is to avoid unwanted over-writing of existing files. This can happen if execution fails for any reason before it has a chance to remove the temporary copy.

<KEEP_COPY>: Boolean true or false, lower case. Indicates that the temporary copy directory should not be removed. Can be useful if the user intends to run multiple Mallet analyses over the same source.

Lastly, by default, the execution is set to run for 50 topics and 250 iterations. This can be modified by adding another two optional arguments:

java -cp ARCADE_Core.jar edu.usc.softarch.arcade.topics.MalletRunner <PATH_TO_SOURCE> <SOURCE_LANGUAGE> <MALLET_PATH> <ARTIFACTS_DIR> <STOPWORD_DIR> <COPY_READY> <KEEP_COPY> <NUM_TOPICS> <NUM_ITERATIONS>

Code Maat: Code Maat is used to collect additional coupling information used by one of ARCADE's smell detectors. In order to run it, a few preliminary steps are required.

git log --all --numstat --date=short --pretty=format:'--%h--%ad--%aN' --no-renames > <ARTIFACTS_DIR>/project.log

<ARTIFACTS_DIR> : The artifacts directory. Note that project.log does not necessarily need to be placed in the artifacts directory, or indeed be named project.log, but we find that keeping fact extractor results in one place facilitates use of ARCADE.

This command should be run from the root of the system under analysis, such as to obtain the version log from git.

sed "s/'//g" <ARTIFACTS_DIR>/project.log > <ARTIFACTS_DIR>/clean_project.log

Likewise, naming and placement of these files can be changed. This command is intended to pre-process the log file, removing all single quotes. As sed may not be available in Windows, other similar means of removing single quotes may be used instead, such as a manual replacement using a text editor.

Finally, execute the included distribution of Code Maat to obtain the project.csv used by the DependencyFinderProcessing smell detector.

java -jar code-maat-1.0-SNAPSHOT-standalone.jar -l <ARTIFACTS_DIR>/clean_project.log -c git2 -a coupling > <ARTIFACTS_DIR>/project.csv

CleanUpCodeMaat: This component modifies the results of Code Maat's execution for use with the DependencyFinderProcessing smell detector. It is only required when analyzing Java projects.

java -cp ARCADE_Core.jar logical_coupling.cleanUpCodeMaat <ARTIFACTS_DIR>

<ARTIFACTS_DIR> : The artifacts directory. Note that no filename should be included: CleanUpCodeMaat is set to execute over any .csv files found in the artifacts directory at the time of execution. Its output will be named project_clean.csv, where project stands for the name given to the result of Code Maat's execution.

PMD: A distribution of PMD is included in this release, along with a distribution of Apache Ant with which to execute it. This fact extractor will obtain necessary clone information to be used by the DependencyFinderProcessing smell detector.

apache-ant-1.9.6/bin/ant.bat -f pmd-bin-5.3.2/cpd.xml cpd -Din=<PATH_TO_SOURCE> -Dout=<ARTIFACTS_DIR>/<CLONES_DIR>/output_clone.xml

<PATH_TO_SOURCE> : This should be a path pointing to the root of the system under analysis, so that PMD may locate all source files by searching that directory's subtree.

<ARTIFACTS_DIR> : The artifacts directory. The suggested filename output_clone.xml may be changed. Note that, once again, this output does not need to be located in the artifacts directory, though we recommend it for tidiness.

<CLONES_DIR> : As DependencyFinderProcessing expects the results from multiple versions to be used as input, the results should be placed together in a subdirectory.

Dependency Finder: A distribution of Dependency Finder is included in this release. It is used to obtain additional dependency information required by the DependencyFinderProcessing smell detector.

DependencyFinder/bin/DependencyExtractor.bat -xml -out <ARTIFACTS_DIR>/<DEPENDENCIES_DIR>/output_deps.xml <PATH_TO_BINARIES>

<ARTIFACTS_DIR> : The artifacts directory. The suggested filename output_deps.xml may be changed. Note that, once again, this output does not need to be located in the artifacts directory, though we recommend it for tidiness.

<DEPENDENCIES_DIR> : As DependencyFinderProcessing expects the results from multiple versions to be used as input, the results should be placed together in a subdirectory.

<PATH_TO_BINARIES> : This should be a path pointing to a directory containing the compiled binaries of the system under analysis. Note that Dependency Finder is a Java dependency extraction tool, and therefore this component will only work with Java-based systems.

OdemToRsf: Converts a ClassDependencyAnalyzer (CDA) ODEM file to an RSF file so that it may be used by ARCADE. For copyright reasons, CDA is not distributed with ARCADE.

java -cp ARCADE_Core.jar edu.usc.softarch.arcade.facts.dependencies.OdemToRsf <PATH_TO_ODEM> <PATH_TO_RSF>

<PATH_TO_ODEM>: Path to the input ODEM file.

<PATH_TO_RSF>: Path to the output RSF file.

UnderstandCsvToRsf: Converts a dependencies CSV file generated by SciTools' Understand to an RSF file so that it may be used by ARCADE. This is the recommended tool for generating ARCADE's inputs, as the other supported tools (including JavaSourceToDepsBuilder and CSourceToDepsBuilder) can take excessive amounts of time for very large systems, and CDA cannot analyze C systems. SciTools' Understand is a proprietary, commercial tool, and we are not able to provide you with a license for it.

java -cp ARCADE_Core.jar edu.usc.softarch.arcade.facts.dependencies.UnderstandCsvToRsf <PATH_TO_CSV> <PATH_TO_RSF> <PROJECT_ROOT_NAME> <SOURCE_LANGUAGE>

<PATH_TO_CSV>: Path to the Understand CSV file.

<PATH_TO_RSF>: Path to the output RSF file.

<PROJECT_ROOT_NAME>: Name of the root directory where the system under analysis was located when Understand was executed. For example, if Understand was executed over C:\Systems\chromium-23.0.1271.97, then this argument should be chromium-23.0.1271.97.

<SOURCE_LANGUAGE>: Source language of the system under analysis, either c or java.

Clustering

ACDC: This is the Algorithm for Comprehension-Driven Clustering, designed and developed by Tzerpos and Holt. The version distributed with ARCADE is modified to conform with modern Java practices, but is otherwise functionally equivalent.

java -cp ARCADE_Core.jar edu.usc.softarch.arcade.clustering.acdc.ACDC <PATH_TO_DEPS_RSF> <PATH_TO_CLUSTERS_OUTPUT>

<PATH_TO_DEPS_RSF> : Path to a dependencies.rsf file generated by a fact extractor, either CSourceToDepsBuilder or JavaSourceToDepsBuilder.

<PATH_TO_CLUSTERS_OUTPUT> : Path at which to place the output .rsf file. Note that required directories will be created if absent.

Clusterer: This is a shared entry point between ARC (Architecture Recovery with Concerns, Garcia et al.), WCA (Weighted Clustering Algorithm, Maqbool and Babri) and Limbo (scaLable InforMation BOttleneck, Andritsos and Tzerpos).

java -cp ARCADE_Core.jar edu.usc.softarch.arcade.clustering.Clusterer <CLUSTERING_ALGORITHM> <LANGUAGE> <RSF_PATH> <STOPPING_CRITERION> <STOPPING_CRITERION_VALUE> <SIMILARITY_MEASURE> <SERIALIZATION_CRITERION> <SERIALIZATION_THRESHOLD> <SUBJECT_SYSTEM_NAME> <SUBJECT_SYSTEM_VERSION> <OUTPUT_PATH> <PACKAGE_PREFIX> <ARTIFACTS_DIR> <REASSIGN_DOCTOPICS>

<CLUSTERING_ALGORITHM>: This is the desired clustering algorithm. Options are arc, limbo and wca.

<LANGUAGE>: The language of the subject system. Currently supported languages are java and c. Note that you must select the same language in both clustering and fact extraction.

<RSF_PATH>: The path to a dependencies RSF file generated by a fact extractor: CSourceToDepsBuilder, JavaSourceToDepsBuilder, OdemToRsf or UnderstandCsvToRsf. UnderstandCsvToRsf is preferable.

<STOPPING_CRITERION>: The criterion to be used to stop the clustering process. Two stopping criteria are currently available: preselected and archsizefraction. preselected establishes a fixed number of clusters at which to stop the clustering process, whereas archsizefraction continues until the number of clusters is the given fraction of the original number of entities.

<STOPPING_CRITERION_VALUE>: The value to be used by the stopping criterion. With preselected, this is the number at which to stop. With archsizefraction, this is a fraction value between 0..1.

<SIMILARITY_MEASURE>: This is the measure which will be used by the clustering algorithm. ARC currently only supports js, which applies the Jensen-Shannon divergence. Limbo is designed to use the il measure, which stands for Info Loss. WCA may use either uem or uemnm, which stand for Unbiased Ellenberg Measure and Unbiased Ellenberg Measure-NM.

<SERIALIZATION_CRITERION>: This is a criterion for how often to serialize the results of the clustering process. Currently supported are archsize, archsizemod, archsizefraction and stepcount. archsize functions the same way as the preselected stopping criterion (note that if archsize < stopping_criterion_value, the results will not be serialized). archsizemod provides a modulo to be applied to the architecture size, such that it will be serialized whenever architecture_size % value == 0. stepcount is the inverse of archsizemod, providing a value such that every n clustering steps, the result is serialized. Finally, archsizefraction functions the same way as the archsizefraction stopping criterion.

DISCLAIMER: All serialization criteria other than archsize are currently experimental, and may cause incorrect results! It is recommended that only archsize be used for this distribution.

<SERIALIZATION_THRESHOLD>: This is the value to be used by the serialization criterion. Each criterion will utilize this value differently, see above.

<SUBJECT_SYSTEM_NAME>: This is a name to be used by serialization to identify the output files.

<SUBJECT_SYSTEM_VERSION>: This is the version of the system under analysis.

<OUTPUT_PATH>: This is a path to a directory in which to place the output files.

<PACKAGE_PREFIX>: This is used for clustering Java systems, and works the same way as in JavaSourceToDepsBuilder. Note that providing a package prefix in this phase is highly encouraged, though providing an empty string ("") will cause the selected clustering algorithm to execute over the entire system space. However, omitting the package prefix will cause results of Limbo and WCA to be entirely useless, and results of ARC to be less than ideal. When analyzing C-based systems, this argument should always be the empty string "".

<ARTIFACTS_DIR>: This is the path to a directory containing the necessary auxiliary input artifacts to execute ARC. Note that this argument is optional for Limbo and WCA.

<REASSIGN_DOCTOPICS>: Boolean true or false, lower case. This argument speeds up ARC when it is run over multiple versions by re-using the partial results of the first execution in the latter ones. Should be true for all executions except the first one.

Smell Detection

ArchSmellDetector: This is the smell detector for dependency- and concern-based smells. It takes in six arguments, five of input and one of output.

java -cp ARCADE_Core.jar edu.usc.softarch.arcade.antipattern.detection.ArchSmellDetector <DEPS_FILE_PATH> <CLUSTERS_FILE_PATH> <OUTPUT_FILE_PATH> <DOC_TOPICS_PATH>

<DEPS_FILE_PATH>: This is a path to a deps.rsf file as generated by a Fact Extractor.

<CLUSTERS_FILE_PATH>: This is a path to a cluster.rsf file as generated by a Clustering Algorithm.

<OUTPUT_FILE_PATH>: This is self-explanatory.

<DOC_TOPICS_PATH>: This is an optional argument to a doc_topics.json file. This is an additional output file generated by ARC, and is therefore only required when running smell detection over the results of ARC.

DependencyFinderProcessing: This is the smell detector for interface- and change-based smells. It takes in six arguments, though they are different from ArchSmellDetector. Note that since this smell detector includes change-based smells, it should be run over the results of multiple clusters files of various versions of a subject system. Also note that this component only works for Java-based systems, as described above in Known Issues.

java -cp ARCADE_Core.jar edu.usc.softarch.arcade.antipattern.detection.interfacebased.DependencyFinderProcessing <CLUSTERS_DIRECTORY> <DEPENDENCIES_DIR> <CLONES_DIR> <CODE_MAAT_RESULTS> <PACKAGE_PREFIX> <OUTPUT_PATH>

<CLUSTERS_DIRECTORY>: This is a directory containing the cluster.rsf results of each version being analyzed.

<DEPENDENCIES_DIR>: This is a directory containing the outputs of running DependencyFinder on each version being analyzed.

<CLONES_DIR>: This is a directory containing the outputs of running PMD on each version being analyzed.

<CODE_MAAT_RESULTS>: This is the file containing the cleaned results from Code Maat, obtained from running the CleanUpCodeMaat fact extractor.

<PACKAGE_PREFIX>: This works the same way as in JavaSourceToDepsBuilder. Note that providing a package prefix in this phase is unnecessary if one has already been provided during the clustering phase.

<OUTPUT_PATH>: Self-explanatory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.2.0