Data

The Solution Example

Given PSC Consulting deep expertise in the integration and interface process, below is a description of the ETL process, with emphasis on interfaces to varies systems.

Extract

The Extract step covers the data extraction from the source system and makes it accessible for further processing. The main objective of the extract step is to retrieve all the required data from the source system with as little resources as possible. The extract step should be designed in a way that it does not negatively affect the source system in terms or performance, response time or any kind of locking.

Update notification

If the source system is able to provide a notification that a record has been changed and describe the change, this is the easiest way to get the data.

Incremental extract

Some systems may not be able to provide notification that an update has occurred, but they are able to identify which records have been modified and provide an extract of such records. During further ETL steps, the system needs to identify changes and propagate it down. Note, that by using daily extract, we may not be able to handle deleted records properly.

Full extract

Some systems are not able to identify which data has been changed at all, so a full extract is the only way one can get the data out of the system. The full extract requires keeping a copy of the last extract in the same format in order to be able to identify changes. Full extract handles deletions as well.

Data Services

Data preprocessing

Data extracting from multiple systems

Data connectors to all major platforms

Data Visualisation

Data prep for modelling

Data De-identification

Clean

The cleaning step is one of the most important as it ensures the quality of the data in the data warehouse. Cleaning should perform basic data unification rules, such as:

  • Making identifiers unique (sex categories Male/Female/Unknown, M/F/null, Man/Woman/Not Available are translated to standard Male/Female/Unknown)
  • Convert null values into standardized Not Available/Not Provided value
  • Convert phone numbers, post codes to a standardized form
  • Validate other business context specific fields as required

Transform

The transform step applies a set of rules to transform the data from the source to the target. This includes converting any measured data to the same dimension (i.e. conformed dimension) using the same units so that they can later be joined.

The transformation step also requires joining data from several sources, generating aggregates, generating surrogate keys, sorting, deriving new calculated values, and applying advanced validation rules.

Load

During the load step, it is necessary to ensure that the load is performed correctly and with as little resources as possible. The target of the Load process is often a database.

In order to make the load process efficient, it is helpful to disable any constraints and indexes before the load and enable them back only after the load completes. The referential integrity needs to be maintained by ETL tool to ensure consistency.

hybrid-it

Managing Extract Transform Load Process for Government Divisions

As with every application, there is a possibility that the ETL process fails. This can be caused by missing extracts from one of the systems, missing values in one of the reference tables, or simply a connection or power outage. Therefore, it is necessary to design the ETL process keeping fail-recovery in mind.

Given PSC experience in the ETL process, using AWS or other cloud vendors like Azure products the following can be utilised to achieve the process in an effective manner.

We provide a number of services, with regards to data. Below gives you the thinking that goes into data management and labelling.

Quality

Data labeling accuracy applies across the entire labeled dataset. Does all of the labelers label consistently, whether you have 10, 20 or 200 labelers. Low quality data can lead to poor model outcomes, this is tested after the model is trained to predict. To create, validate and maintain production for high performing machine learning models, you have to train using reliable data.

Accuracy

In data labeling, accuracy is measured by how close is the label to ground truth, or how well the features chosen that produce the label is within real world examples.

Ethics

At present, few safeguards exist to prevent model builders or users from deliberately, carelessly, or recklessly manipulating data to further their own ends. Perhaps more importantly, few people understand or appreciate that harm can be caused when builders or users fail to recognise the values and assumptions on which a model is based or fail to take into account all the groups who would be affected by a model’s results. Evaluation provides a setting for ethics and shows the need to continue and define a vocabulary for exploring ethical concerns. It will become increasingly important for model builders and users to have a clear and strong code of ethics to guide them in making the ethical decisions they surely will have to face.

How is quality measured in data labeling ?

  • Gold Class – Measure against correct or incorrect labeled data.
  • Randomised sampling – Select a random sample of completed tasks. A project co-ordinator or domain expert can review for accuracy.
  • Consensus – Assign several people to the same task.
  • Intersection or union – It compares people and machine, validating and labeling data and auto annotation of data.