Load Ingest Data
Results 1 to 1 of 1

Thread: Continuous SQL-Based CDC - Continuous Query

  1. #1
    Join Date
    Sep 2006
    Posts
    233
    Rep Power
    10

    Continuous SQL-Based CDC - Continuous Query

    Noth: The feature described in this article is available only in AIS version 5.1
    Continuous CDC Explained

    There are several methods for consuming changes stored in the AIS staging area. Many ETL tools work by periodically ‘polling’ the change tables in the staging area for changes made since the last ETL run. This is done using an SQL statement such as the following one:

    Code:
    select * from INVENTORY_CHANGES where CONTEXT > ?
    Here, the parameter value is the last value returned in the previous ETL run as stored in some auxiliary table in a target database.

    This approach works well for polling cycles that are not too close (i.e., up to several times an hour). However, in cases where near real-time SQL-based change capture is required, this approach is not efficient. The cost of starting ETL runs frequently is high. This is more so if several change tables need to be monitored concurrently in near-real time.

    The Continuous CDC approach uses a special feature of the Attunity query processor that allows one to define a query that (almost) never ends – a Continuous Query. Internally, Attunity Connect does the polling by re-executing the query as needed but with the least possible overhead. For the ETL tool that uses a data access API such as ODBC, it just appears as if the query never ends. If no data is available, the ETL tool will just be blocked in a call to fetch the next row of data. When new data becomes available, the ETL will again start receiving rows back from the fetch call.

    The Continuous CDC approach requires one ETL run to be continuously active per consumed change table.

    Continuous Queries
    A continuous query is written using the following form:

    Code:
     
    select CONTEXT as $$StreamPosition, * from INVENTORY_CHANGES where CONTEXT > ?
    Here, too, the parameter value is the last value returned in the previous ETL run as stored in some auxiliary table in a target database.

    The only difference from the previous example is that we gave the CONTEXT column of the change table the alias name ‘$$StreamPosition (case insensitive). This special alias name instruct the query processor, once started the query with some initial value for the parameter, to read until the end of data and then to re-run the query with the last value of CONTEXT as the new parameter value.

    For a continuous query to work the following guidelines must be followed:
    1. The Stream Position column must be the first column on the SELECT list and with the alias name of ‘$$StreamPosition’.
    2. The matching Stream Position parameter must be the first parameter in the WHERE clause.


    Continuous queries can be invoked from ODBC, JDBC and ADO.NET interfaces (see comment about ADO at the end on the article).

    Stopping or Pausing a Continuous Query
    Running a query continuously is fine but there always comes a time when one will want to stop to continuous query. There may be all kinds of reasons for stopping a continuous query. For example, the system may need to go down for maintenance or maybe the ETL needs only active at some specific windows during the day.

    One simple way to programmatically stop a continuous query under ODBC is by calling SQLCancel() on a different thread. This, however, is ODBC-specific and programmatic – an approach is not always possible when dealing off-the-shelf tools.

    A far more generic way to stop a continuous query is to use yet another special column alias name '’$$ControlCommand’. This method allow one to pause or resume a continuous query as well as to stop it.

    When a Control Command column is used, the query processor checks its value and acts according to the value found. The following table describes the different Control Command column values:

    Code:
    +-------------------------+-----------------------------------------------------+
    | Value of the            |                                                     |
    | $$ControlCommand Column | Effect                                              |
    +-------------------------+-----------------------------------------------------+
    | (empty)                 | Query runs, returning rows as they become available |
    +-------------------------+-----------------------------------------------------+
    | Pause                   | Query is suspended, the calling application is      |
    |                         | blocked in the fetch call. After a retry interval,  |
    |                         | the query is re-executed and the value of the       |
    |                         | Control Command column is checked again.            |
    +-------------------------+-----------------------------------------------------+
    | Stop                    | The continuous query ends, returning end-of-data    |
    +-------------------------+-----------------------------------------------------+
    The following query demonstrates the use of a Control Command column:

    Code:
    select t.context AS $$StreamPosition, sp.command AS $$ControlCommand, t.*, sp.* from T t, STREAM_POSITIONS sp where
    sp.application_name = 'A' and
    sp.table_name = 'T' and
    t.context >= ?
    order by t.context
    limit to 1000 rows;
    Here, table T is the change table and STREAM_POSITIONS is the table used to record the last position processed in the previous ETL run. The STREAM_POSITIONS table contains a column named ‘command’ that gets the alias name $$ControlCommand. The STREAM_POSITION table must contain a row for the application and table before running this query. To stop this query while it is running continuously, one simply needs to run the following query from another session:

    Code:
    update STREAM_POSITIONS set command=’Stop’ where 
    sp.application_name = 'A' and
    sp.table_name = 'T'
    When writing a continuous query with a Control Command column, one must pay attention to the following points:

    1. The Control Command column must be on the second column in the SELECT list of the query, immediately following the Stream Position column.
    2. The condition on the Stream Position column must be >= (greater or equal) rather than > (greater). The reason is that when no rows are returned the Control Command column would also not be returned. Note that if the change table is empty, there would also be a problem to stop the continuous query!
    3. The ‘limit to 1000 rows’ clause was added so that if the query returns very long number of rows without re-executing, it may take it very long to re-evaluate the Control Command column. By setting a limit on the number of rows returned per query, the value Control Command column is re-evaluated relatively often. The limit of 1000 used here is arbitrary – one can use higher or lower values.


    The value of the Control Command column does not have to come from another table. It may be a calculated column that, for example set its value based on the time of day (for example, if may return ‘Pause’ value between 2:00PM and 4:00PM).

    When a continuous query is used with a thin-client (JDBC or ADO.NET), the polling happens on the server and so it may take a long time before response is returned to the client. The server keeps monitoring the client connection so if the client goes away, the server will automatically stop the continuous query at the next poll event.

    Continuous Query Settings

    There are three new Query Processor environment settings that affect the behavior of continuous queries:

    continuousQueryRetryInterval
    This parameter determines how long (in seconds) the query processor waits after one query execution returned no rows, before re-executing the query. The default value of this parameter is 2 seconds.

    continuousQueryTimeout
    This parameter determines how long (in seconds) the query processor will re-execute the query when no records are returned. The default value of this parameter is 3600 seconds (an hour) which means that, by default, after an hour without new messages, the continuous query will end. The maximum value of 2147483648 will be enough for around 68 years.

    continuousQueryPrefix
    This parameter can be used to change the ‘$$’ prefix used for identifying the continuous query special columns. The default of this parameter is ‘$$’. By setting this parameter to ‘##’, for example, the Stream Position alias name will be ‘##StreamPosition’ and the Control Command alias name will be ‘##ControlCommand’.

    Miscellaneous
    There working of a continuous query is subject to the normal query processing rules. One must ensure that the SQL used will return the result in the order of the Stream Position column – having a single-segment unique index on that column plus express sort condition on it is one way to ensure that (without performance penalty).

    Note Regarding ADO and Other Bulk Interfaces
    The continuous query feature is not compatible with the Microsoft ADO interface. The reason for that is that ADO tries to read several rows ahead. As a result, if only one row is available and ADO insists on getting 10 rows (or end-of-file indication) then ADO will be blocked on the read until 10 rows are actually availeable or otherwise until the continuous query ends.
    The same issue exists also with ODBC when using extended fetch rather than single row fetch.
    Last edited by DrorHarari; 12-09-2007 at 06:04 AM.
    By Dror Harari

    To Find Out more About Attunity Technology:
    Attunity
    or:
    Contact Us

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •