causal-cmd v1.10.x
Introduction
Causal-cmd is a Java application that provides a Command-Line Interface (CLI) tool for causal discovery algorithms produced by the Center for Causal Discovery. The application currently includes the following algorithms:
boss, bpc, ccd, cpc, cstar, fas, fask, fask-pw, fci, fcimax, fges, fges-mb, fofc, ftfc, gfci, grasp, grasp-fci, ica-ling-d, ica-lingam, images, mgm, pag-sampling-rfci, pc, pc-mb, pcmax, r-skew, r3, rfci, skew, spfci, svar-fci, svar-gfci
Causal discovery algorithms are a class of search algorithms that explore a space of graphical causal models, i.e., graphical models where directed edges imply causation, for a model (or models) that are a good fit for a dataset. We suggest that newcomers to the field review Causation, Prediction and Search by Spirtes, Glymour and Scheines for a primer on the subject.
Causal discovery algorithms allow a user to uncover the causal relationships between variables in a dataset. These discovered causal relationships may be used further--understanding the underlying the processes of a system (e.g., the metabolic pathways of an organism), hypothesis generation (e.g., variables that best explain an outcome), guide experimentation (e.g., what gene knockout experiments should be performed) or prediction (e.g. parameterization of the causal graph using data and then using it as a classifier).
Command Line Usage
Java 8 or higher is the only prerequisite to run the software. Note that by default Java will allocate the smaller of 1/4 system memory or 1GB to the Java virtual machine (JVM). If you run out of memory (heap memory space) running your analyses you should increase the memory allocated to the JVM with the following switch '-XmxXXG' where XX is the number of gigabytes of ram you allow the JVM to utilize. For example to allocate 8 gigabytes of ram you would add -Xmx8G immediately after the java command.
In this example, we'll use download the Retention.txt file, which is a dataset containing information on college graduation and used in the publication of "What Do College Ranking Data Tell Us About Student Retention?" by Drudzel and Glymour, 1994.
Keep in mind that causal-cmd has different switches for different algorithms. To start, type the following command in your terminal:
java -jar causal-cmd-<version number>-jar-with-dependencies.jar
** Note: we are using causal-cmd-<version number>-jar-with-dependencies.jar
to indicate the actual executable jar of specific version number that is being used. **
And you'll see the following instructions:
Missing required options: algorithm, data-type, dataset, delimiter
usage: java -jar Causal-cmd Project-1.10.0.jar --algorithm <string> [--comment-marker <string>] --data-type <string> --dataset <files> [--default] --delimiter <string> [--experimental] [--help] [--help-algo-desc] [--help-all] [--help-score-desc] [--help-test-desc] [--json-graph] [--metadata <file>] [--no-header] [--out <directory>] [--prefix <string>] [--quote-char <character>] [--skip-validation] [--version]
--algorithm <string> Algorithm: boss, bpc, ccd, cpc, cstar, dagma, direct-lingam, fas, fask, fask-pw, fci, fci-iod, fci-max, fges, fges-mb, fofc, ftfc, gfci, grasp, grasp-fci, ica-ling-d, ica-lingam, images, mgm, pag-sampling-rfci, pc, pc-mb, r-boss, r-skew, r3, rfci, skew, spfci, svar-fci, svar-gfci
--comment-marker <string> Comment marker.
--data-type <string> Data type: all, continuous, covariance, discrete, mixed
--dataset <files> Dataset. Multiple files are seperated by commas.
--default Use Tetrad default parameter values.
--delimiter <string> Delimiter: colon, comma, pipe, semicolon, space, tab, whitespace
--experimental Show experimental algorithms, tests, and scores.
--help Show help.
--help-algo-desc Show all the algorithms along with their descriptions.
--help-all Show all options and descriptions.
--help-score-desc Show all the scores along with their descriptions.
--help-test-desc Show all the independence tests along with their descriptions.
--json-graph Write out graph as json.
--metadata <file> Metadata file. Cannot apply to dataset without header.
--no-header Indicates tabular dataset has no header.
--out <directory> Output directory
--prefix <string> Replace the default output filename prefix in the format of <algorithm>_<numeric timestamp>.
--quote-char <character> Single character denotes quote.
--skip-validation Skip validation.
--version Show version.
Use --help for guidance list of options. Use --help-all to show all options.
By specifying an algorithm using the --algorithm switch the program will indicate the additional required switches. The program reminds the user of required switches to run. In general most algorithms also require data-type, dataset, delimiter and score. The switch --help-all displays and extended list of switches for the algorithm.
Example of listing all available options for an algorithm:
$ java -jar causal-cmd-1.9.0-jar-with-dependencies.jar --algorithm fges --data-type continuous --dataset Retention.txt --delimiter tab --score sem-bic-score --help
usage: java -jar Causal-cmd Project-1.10.0.jar --algorithm fges --data-type continuous --dataset Retention.txt --delimiter tab --score sem-bic-score [--addOriginalDataset] [--choose-dag-in-pattern] [--choose-mag-in-pag] [--comment-marker <string>] [--default] [--exclude-var <file>] [--experimental] [--external-graph <file>] [--extract-struct-model] [--faithfulnessAssumed] [--generate-complete-graph] [--genereate-pag-from-dag] [--genereate-pag-from-tsdag] [--genereate-pattern-from-dag] [--json-graph] [--knowledge <file>] [--make-all-edges-undirected] [--make-bidirected-undirected] [--make-undirected-bidirected] [--maxDegree <integer>] [--meekVerbose] [--metadata <file>] [--missing-marker <string>] [--no-header] [--numberResampling <integer>] [--out <directory>] [--parallelized] [--penaltyDiscount <double>] [--percentResampleSize <integer>] [--precomputeCovariances] [--prefix <string>] [--quote-char <character>] [--resamplingEnsemble <integer>] [--resamplingWithReplacement] [--saveBootstrapGraphs] [--seed <long>] [--semBicRule <integer>] [--semBicStructurePrior <double>] [--skip-validation] [--symmetricFirstStep] [--timeLag <integer>] [--verbose]
--addOriginalDataset Yes, if adding the original dataset as another bootstrapping
--choose-dag-in-pattern Choose DAG in Pattern graph.
--choose-mag-in-pag Choose MAG in PAG.
--comment-marker <string> Comment marker.
--default Use Tetrad default parameter values.
--exclude-var <file> Variables to be excluded from run.
--experimental Show experimental algorithms, tests, and scores.
--external-graph <file> External graph file.
--extract-struct-model Extract sturct model.
--faithfulnessAssumed Yes if (one edge) faithfulness should be assumed
--generate-complete-graph Generate complete graph.
--genereate-pag-from-dag Generate PAG from DAG.
--genereate-pag-from-tsdag Generate PAG from TsDAG.
--genereate-pattern-from-dag Generate pattern graph from PAG.
--json-graph Write out graph as json.
--knowledge <file> Prior knowledge file.
--make-all-edges-undirected Make all edges undirected.
--make-bidirected-undirected Make bidirected edges undirected.
--make-undirected-bidirected Make undirected edges bidirected.
--maxDegree <integer> The maximum degree of the graph (min = -1)
--meekVerbose Yes if verbose output for Meek rule applications should be printed or logged
--metadata <file> Metadata file. Cannot apply to dataset without header.
--missing-marker <string> Denotes missing value.
--no-header Indicates tabular dataset has no header.
--numberResampling <integer> The number of bootstraps/resampling iterations (min = 0)
--out <directory> Output directory
--parallelized Yes if the search should be parallelized
--penaltyDiscount <double> Penalty discount (min = 0.0)
--percentResampleSize <integer> The percentage of resample size (min = 10%)
--precomputeCovariances True if covariance matrix should be precomputed for tubular continuous data
--prefix <string> Replace the default output filename prefix in the format of <algorithm>_<numeric timestamp>.
--quote-char <character> Single character denotes quote.
--resamplingEnsemble <integer> Ensemble method: Preserved (1), Highest (2), Majority (3)
--resamplingWithReplacement Yes, if sampling with replacement (bootstrapping)
--saveBootstrapGraphs Yes if individual bootstrapping graphs should be saved
--seed <long> Seed for pseudorandom number generator (-1 = off)
--semBicRule <integer> Lambda: 1 = Chickering, 2 = Nandy
--semBicStructurePrior <double> Structure Prior for SEM BIC (default 0)
--skip-validation Skip validation.
--symmetricFirstStep Yes if the first step step for FGES should do scoring for both X->Y and Y->X
--timeLag <integer> A time lag for time series data, automatically applied (zero if none)
--verbose Yes if verbose output should be printed or logged
In this example, we'll be running the FGES algorith on the dataset Retention.txt
.
$ java -jar causal-cmd-1.10.0-jar-with-dependencies.jar --algorithm fges --data-type continuous --dataset Retention.txt --delimiter tab --score sem-bic-score
This command will output by default one file fges_<unix timestamp>.txt which is a log and result of the algorithm's activity.
'--json-graph' option will enable output fges_<unix timestamp>_graph.json which is a json graph from the algorithm, which is equivalent to the exported json file from tetrad-gui.
Example log output from causal-cmd:
================================================================================
FGES (Wed, October 04, 2023 01:42:43 PM)
================================================================================
Runtime Parameters
--------------------------------------------------------------------------------
number of threads: 7
Dataset
--------------------------------------------------------------------------------
file: Retention.txt
header: yes
delimiter: tab
quote char: none
missing marker: none
comment marker: none
Algorithm Run
--------------------------------------------------------------------------------
algorithm: FGES
score: Sem BIC Score
Algorithm Parameters
--------------------------------------------------------------------------------
addOriginalDataset: no
faithfulnessAssumed: no
maxDegree: 1000
meekVerbose: no
numberResampling: 0
parallelized: no
penaltyDiscount: 2.0
percentResampleSize: 100
precomputeCovariances: no
resamplingEnsemble: 1
resamplingWithReplacement: no
saveBootstrapGraphs: no
seed: -1
semBicRule: 1
semBicStructurePrior: 0.0
symmetricFirstStep: no
timeLag: 0
verbose: no
Wed, October 04, 2023 01:42:45 PM: Start data validation on file Retention.txt.
Wed, October 04, 2023 01:42:45 PM: End data validation on file Retention.txt.
There are 170 cases and 8 variables.
Wed, October 04, 2023 01:42:45 PM: Start reading in file Retention.txt.
Wed, October 04, 2023 01:42:45 PM: Finished reading in file Retention.txt.
Wed, October 04, 2023 01:42:45 PM: File Retention.txt contains 170 cases, 8 variables.
Start search: Wed, October 04, 2023 01:42:45 PM
End search: Wed, October 04, 2023 01:42:45 PM
================================================================================
Graph Nodes:
spending_per_stdt;grad_rate;stdt_clss_stndng;rjct_rate;tst_scores;stdt_accept_rate;stdt_tchr_ratio;fac_salary
Graph Edges:
1. spending_per_stdt --- fac_salary
2. spending_per_stdt --- rjct_rate
3. spending_per_stdt --- stdt_tchr_ratio
4. stdt_accept_rate --- fac_salary
5. stdt_clss_stndng --- rjct_rate
6. stdt_clss_stndng --- tst_scores
7. tst_scores --- fac_salary
8. tst_scores --- grad_rate
9. tst_scores --- rjct_rate
10. tst_scores --- spending_per_stdt
Graph Attributes:
Score: -5181.565079
Graph Node Attributes:
Score: [spending_per_stdt: -1408.4382541909688;grad_rate: -416.7933531919986;stdt_clss_stndng: -451.79480827547627;rjct_rate: -439.8087229322177;tst_scores: -330.2039598576225;stdt_accept_rate: -429.64771587695884;stdt_tchr_ratio: -208.85274641239832;fac_salary: -1496.025518245214]
Interpretation of graph output
The end of the file contains the causal graph edgesfrom the search procedure. Here is a key to the edge types:
- A --- B - There is causal relationship between variable A and B, but we cannot determine the direction of the relationship
- A --> B - There is a causal relationship from variable A to B
The GFCI algorithm has additional edge types:
- A <-> B - There is an unmeasured confounder of A and B
- A o-> B - Either A is a cause of B or there is an unmeasured confounder of A and B or both
-
A o-o B - Either (1) A is a cause of B or B is a cause of A, or (2) there is an unmeasured confounder of A and B, or both 1 and 2 hold.
-
A --> B dd nl - Definitely direct causal relationship and no latent confounder
- A --> B pd nl - Possibly direct and no latent confounder
- A --> B pd pl - Possibly direct and possibly latent confounder
NNote: the generated result file name is based on the system clock.
Sample Prior Knowledge File
From the above useage guide, we see the option of --knowledge <arg>
, with which we can specify the prior knowledge file. Below is the content of a sample prior knowledge file:
/knowledge
addtemporal
1 spending_per_stdt fac_salary stdt_tchr_ratio
2 rjct_rate stdt_accept_rate
3 tst_scores stdt_clss_stndng
4* grad_rate
forbiddirect
x3 x4
requiredirect
x1 x2
The first line of the prior knowledge file must say /knowledge
. And a prior knowledge file consists of three sections:
- addtemporal - tiers of variables where the first tier preceeds the last. Adding a asterisk next to the tier id prohibits edges between tier variables
- forbiddirect - forbidden directed edges indicated by a list of pairs of variables: from -> to direction
- requireddirect - required directed edges indicated by a list of pairs of variables: from -> to direction