We are developing a solution to replicate data from multiple Oracle sources to an S3 bucket, which is then processed by Pentaho Data Integrator and written to a Snowflake database in the cloud.

There are concerns that we could have situations where the source data dependencies aren't seen in the target DB in the right order because of the timing of writing CDC files.

For example if we have a ORDER and ORDER_DETAIL tables, and order#1 has order detail records 1A and 1B, it may be possible for the order details to be written to S3 before the order data is written for the parent record, leaving orphaned records in the S3 target. I don't know how 'smart' Replicate can be in such situations to be sure there's no orphaning done.

Can the schedule / invoking of Repliweb jobs be controlled by an outside system such as Pentaho, making PDI the orchestrator of data movement?