Split a file into individual records using SCOPE_CHUNK¶

Introduction¶

This design pattern can be used to split the data in a multi-record file into multiple files, each containing a single record, using the SCOPE_CHUNK prefix syntax of the Set function.

Tip

This pattern is recommended when the source data is flat (non-hierarchical) and the operation does not contain a transformation that uses conditional mapping. For complex (hierarchical) source data, see Split a file into individual records using SourceInstanceCount.

Use case¶

In this scenario, the source data contains multiple records, and the business process or target endpoint requires records to be processed individually.

Design pattern¶

This design pattern is to read the file using an operation where the transformation is using the same source and target schema. This pattern has these key characteristics:

In the first field of the transformation processing the data, the Set function is used to set a variable that begins with SCOPE_CHUNK followed by additional text to construct the variable name. The first argument builds the output filename, typically using a record identifier from the source data. The filenames must be unique to avoid being overwritten, so a record counter or a GUID can be used as part of the filename.
A file storage target such as Temporary Storage or Local Storage must be used. (Using a Variable is not supported.) Defining a path is recommended. The filename should be configured with the global variable name used in the Set function.
The operation options must be configured with Enable Chunking selected, with a Chunk Size of 1, Number of Records per File of 1, and Max Number of Threads of 1.
During runtime, each record will be read and assigned a unique filename, and individually written to the target.

After using this pattern, typically the next step is to use the FileList function to get the array of filenames in the directory (configure the file directory Read activity with the * wildcard), then loop through the array and read each file into the source for the next operation.

Examples¶

Examples using the design pattern described above are provided for two different types of source data: a flat CSV file and a hierarchical JSON file.

Flat CSV¶

This example operation chain applies the design pattern described above to split flat CSV data into a file for each record. Each number corresponds with a description of the operation step below. For additional details, refer to the screenshots for the Hierarchical JSON example.

operation chain CSV splitter annotated

The script assigns the source data to a global variable named io (input/output). (A script is used for demonstration purposes; the source data could also come from a configured endpoint.)
The io global variable is used to configure a Variable endpoint, and an associated Variable Read activity is used by the operation as the source of the transformation.
Within the transformation, the source and targets use the same schema, and all the fields are mapped. The first data field in the transformation's mapping script is configured to use SCOPE_CHUNK. The Set function is used to build a variable starting with the phrase SCOPE_CHUNK and is concatenated with the unique record ID from the source as well as a record counter, and a suffix of .csv:
The target is a Temporary Storage Write activity configured with a default path and the filename global variable defined earlier.
The scripts are for logging the output and are optional. The first script gets a list of the files from the directory and loops over the list, logging the filename and size, and then passes the filename to an operation with another script that logs the file contents. (See the screenshots for the Hierarchical JSON example above.)
The operation options must be configured with Enable Chunking selected, with a Chunk Size of 1, Number of Records per File of 1, and Max Number of Threads of 1.

Running the operation results in individual CSV records, shown in the log output:

logs CSV splitter

Hierarchical JSON¶

This example operation chain applies the design pattern described above to split hierarchical JSON data into a file for each record. Each number corresponds with a description of the operation step below.

Note

This design pattern is no longer the recommended method of splitting hierarchical JSON data. For the recommended method, see Split a file into individual records using SourceInstanceCount.

operation chain JSON splitter annotated

The script assigns the source data to a global variable named io (input/output). (A script is used for demonstration purposes; the source data could also come from a configured endpoint.)
The io global variable is used to configure a Variable endpoint, and an associated Variable Read activity is used by the operation as the source of the transformation:
Within the transformation, the source and target use the same schema, and all the fields are mapped:

The top item node has a condition to generate the record count. The first data field in the transformation mapping is configured to use SCOPE_CHUNK. The Set function is used to build a variable starting with the phrase SCOPE_CHUNK and is concatenated with the unique record ID from the source as well as a record counter, and a suffix of .json:
The target is a Temporary Storage Write activity configured with a default path and the filename global variable defined earlier:

The scripts are for logging the output and are optional. The first script gets a list of the files from the directory and loops over the list, logging the filename and size, and then passes the filename to an operation with another script that logs the file contents:

log_file_list_json

<trans>
arr = Array();
arr = FileList("<TAG>activity:tempstorage/json_splitter/tempstorage_read/Read</TAG>");
cnt = Length(arr); i = 0;
While(i < cnt,
filename = arr[i];
file = ReadFile("<TAG>activity:tempstorage/json_splitter/tempstorage_read/Read</TAG>",filename);
WriteToOperationLog("file: " + filename + " has a length of: " + Length(file));
  $gv_file_filter = filename;
RunOperation("<TAG>operation:log_each_file_json</TAG>");
i++
);
</trans>

log_each_file_json

<trans>
WriteToOperationLog(ReadFile("<TAG>activity:tempstorage/json_splitter/tempstorage_read/Read</TAG>",$gv_file_filter));
</trans>

The operation options must be configured with Enable Chunking selected, with a Chunk Size of 1, Number of Records per File of 1, and Max Number of Threads of 1:

Running the operation results in individual JSON records, shown in the log output:

logs JSON splitter