Creación de reglas de creación para dependencias entre objetivos en los subdirectorios del proyecto
Frecuentes
Visto 416 equipos
0
Source code tree (R
) for my dissertation research software reflects traditional research workflow: "collect data -> prepare data -> analyze data -> collect results -> publish results". I use make
to establish and maintain the workflow (most of the project's sub-directories contain Makefile
archivos).
However, frequently, I need to execute individual parts of my workflow via particular Makefile targets in project's sub-directories (not via top-level Makefile
). This creates a problem of setting up Makefile
reglas para mantener la dependencias between targets from different parts of the workflow, in other words - between targets in Makefile
files, located in different sub-directories.
The following represents the Configure for my dissertation project:
+-- diss-floss (Project's root)
|-- import (data collection)
|-- cache (R data objects (), representing different data sources, in sub-directories)
|-+ prepare (data cleaning, transformation, merging and sampling)
|-- R modules, including 'transform.R'
|-- analysis (data analyses, including exploratory data analysis (EDA))
|-- R modules, including 'eda.R'
|-+ results (results of the analyses, in sub-directories)
|-+ eda (*.svg, *.pdf, ...)
|-- ...
|-- present (auto-generated presentation for defense)
Snippets of targets from some of my Makefile
archivos:
"~/diss-floss/Makefile" (almost full):
# Major variable definitions
PROJECT="diss-floss"
HOME_DIR="~/diss-floss"
REPORT={$(PROJECT)-slides}
COLLECTION_DIR=import
PREPARATION_DIR=prepare
ANALYSIS_DIR=analysis
RESULTS_DIR=results
PRESENTATION_DIR=present
RSCRIPT=Rscript
# Targets and rules
all: rprofile collection preparation analysis results presentation
rprofile:
R CMD BATCH ./.Rprofile
collection:
cd $(COLLECTION_DIR) && $(MAKE)
preparation: collection
cd $(PREPARATION_DIR) && $(MAKE)
analysis: preparation
cd $(ANALYSIS_DIR) && $(MAKE)
results: analysis
cd $(RESULTS_DIR) && $(MAKE)
presentation: results
cd $(PRESENTATION_DIR) && $(MAKE)
## Phony targets and rules (for commands that do not produce files)
#.html
.PHONY: demo clean
# run demo presentation slides
demo: presentation
# knitr(Markdown) => HTML page
# HTML5 presentation via RStudio/RPubs or Slidify
# OR
# Shiny app
# remove intermediate files
clean:
rm -f tmp*.bz2 *.Rdata
"~/diss-floss/import/Makefile":
importFLOSSmole: getFLOSSmoleDataXML.R
@$(RSCRIPT) $(R_OPTS) $<
...
"~/diss-floss/prepare/Makefile":
transform: transform.R
$(RSCRIPT) $(R_OPTS) $<
...
"~/diss-floss/analysis/Makefile":
eda: eda.R
@$(RSCRIPT) $(R_OPTS) $<
Currently, I am concerned about creating the following dependency:
Data, collected by making a target from Makefile
in import
, always needs to be transformed by making corresponding target from Makefile
in prepare
before being analyzed via, for example eda.R
. If I manually run make
in import
and then, forgetting about transformation, run make eda
in analyze
, things are not going too well. Therefore, my question is:
How could I use features of the make
utility (in a simplest way possible) to establish and maintain rules for dependencies between targets from Makefile
files in different directories?
2 Respuestas
1
The problem with your use of makefile right now is that you are only listing the code as dependencies, not the data. That's where a lot of the magic happens. If the "analyze" knew what files it was going to use and could list those as dependencies, it could look back to see how they were made and what dependencies they had. And if an earlier file in the pipeline was updated, then it could run all the necessary steps to bring the file up to date. For example
import: rawdata.csv
rawdata.csv:
scp remoteserver:/rawdata.csv .
transform: tansdata.csv
transdata.csv: gogo.pl rawdata.csv
perl gogo.pl $< > $@
plot: plot.png
plot.png: plot.R transdata.csv
Rscript plot.R
So if I do a make import
it will download a new csv file. Then if I run make plot
, it will try to make plot.png
but that depends on transdata.csv
and that depends on rawdata.csv
y desde rawdata.csv
was updated, it will need to have to update transdata.csv
and then it will be ready to run the R script. If you don't explicitly set a lot of the file dependencies, you're missing out on a lot of the power of make. But to be fail, it can be tricky sometimes to get all the right dependencies in there (especially if you produce multiple output from one step).
contestado el 29 de mayo de 14 a las 04:05
Thank you very much for the answer! You're right that as of now my use of make
is basic (code only), but this is by design. I deliberately delayed creating dependencies, based on data, because earlier I haven't had a clear idea about my workflow and structure of data components. Now, as I have clearer realization about workflow and data, it's time to move on to more advanced use of make
's power to automate the workflow for my research. Hence my question. (To be continued) - Aleksandr Blékh
Incluso si make
has features to maintain dependencies between targets and multiple files (as a whole directory), I'm still leaning toward switching back to use .RData
files (each per data source) instead of zillion of .rds
files (each per indicator). This will not only simplify Makefile
files, but, I hope, allow easier and more natural merging, sampling and visualization of data as well as results of data analysis. - Aleksandr Blékh
@AleksandrBlekh I think that does sound reasonable. And I've found that choosing good file names is very important. If the files at different steps in the pipeline are different just by a prefix or suffix or directory, then it because much easier to write elegant rules. make
only has one useful form of wildcard/pattern matching and it's pretty limited. - MrFlick
Would appreciate, if you could share your feedback on my answer. - Aleksandr Blékh
0
The following are my thoughts (with some ideas from @MrFlick's answer - thank you) on adding my research workflow's dependencias de datos to the project's current make
infrastructure (with snippets of code). I have also tried to reflect the desired workflow by specifying dependencias entre make
tiene como objetivo.
import/Makefile:
importFLOSSmole: getFLOSSmoleDataXML.R FLOSSmole.RData
@$(RSCRIPT) $(R_OPTS) $<
@touch $@.done
(similar targets for other data sources)
prepare/Makefile:
IMPORT_DIR=../import
prepare: import \
transform \
cleanup \
merge \
sample
import: $IMPORT_DIR/importFLOSSmole.done # and/or other flag files, as needed
transform: transform.R import
@$(RSCRIPT) $(R_OPTS) $<
@touch $@.done
cleanup: cleanup.R transform
@$(RSCRIPT) $(R_OPTS) $<
@touch $@.done
merge: merge.R cleanup
@$(RSCRIPT) $(R_OPTS) $<
@touch $@.done
sample: sample.R merge
@$(RSCRIPT) $(R_OPTS) $<
@touch $@.done
analysis/Makefile:
PREP_DIR=../prepare
analysis: prepare \
eda \
efa \
cfa \
sem
prepare: $PREP_DIR/transform.done # and/or other flag files, as needed
eda: eda.R prepare
@$(RSCRIPT) $(R_OPTS) $<
@touch $@.done
efa: efa.R eda
@$(RSCRIPT) $(R_OPTS) $<
@touch $@.done
cfa: cfa.R efa
@$(RSCRIPT) $(R_OPTS) $<
@touch $@.done
sem: sem.R cfa
@$(RSCRIPT) $(R_OPTS) $<
@touch $@.done
Los contenidos de Makefile
files in directories results
y present
are still TBD.
I would appreciate your thoughts and advice on the above!
contestado el 29 de mayo de 14 a las 06:05
make
checks the last modified date on the files to see if they need to be re-built. Since all of your targets are phony (ie, they are not the actual names of files on disc) they will all be re-built every time. This is probably not the behavior you want. - MrFlick
But you are not listing the .done files as a dependency or as a target so make doesn't know about them. Using done files can be a good strategy, but they should be a part of the dependency chain to be useful in my opinion. - MrFlick
But I don't see a build rule for transform.done. I only see a phony target called transform. Make will not know how to build transform.done if it doesn't exist. Unless that's part of the code you are leaving out or you don't wish to automate the building of those dependencies. - MrFlick
This has become difficult to address via comments. Make does not look into build recipes to see what's actually made. Everything must be specified on the rule definition line. I suggest you make a small test case for yourself to try out different combinations. You can use make -n
to see what make would run without actually doing the building. - MrFlick
I would look again at the answer that I originally posted to this question. That's the strategy I recommend. (You may replace file extensions with .done if you like.) - MrFlick
No es la respuesta que estás buscando? Examinar otras preguntas etiquetadas r makefile dependencies workflow or haz tu propia pregunta.
What is the target in
import/Makefile
, and what does it actually produce? What is the target inprepare/Makefile
, and what does it actually produce? When youmake eda
inanalyze
, what files does it use as input? - Beta@Beta: Sorry about delay - just got back online. Each target in
import/Makefile
, Tales comoimportFLOSSmole
, produces a set of.rds
files (I'm considering a change to produce a single.RData
file) incache/<DataSourceName>
sub-directory. Correspondingly, each target inprepare
either updates existing R data files (targetstransform
ycleanup
) or produces new R data files (targetsmerge
ysample
) encache
subdirectorios.eda
objetivo enanalyze/Makefile
depende.rds
archivos encache
sub-directories and produces.svg
y.pdf
archivos enresults/eda
directorio. - Aleksandr Blekh