Data Validation

 

 

The Bright Data approach to high-quality data

Bright Data’s proactive approach to validated data ensures that any deviation from predefined standards is caught early, reducing the risk of data corruption or misuse.
By defining clear validation rules, we are able to maintain a strong foundation for data quality that supports accurate analytics, confident decision-making, and ensuring compliance with industry standards.


What is data validation?

Data validation refers to the process of ensuring the accuracy and quality of data. Validating data confirms that the values entered into data objects conform to the constraint within the dataset schema. The validation process also ensures that these values follow the rules established for your application. Validating data before updating your application's database is a good practice as it reduces errors and the number of round trips between an application and the database.


Why is it crucial to validate the data?

Data providers must maintain rigorous quality control measures and offer ongoing support for data-related issues so businesses can trust their data validation processes and expertise.

  • Accuracy: Businesses must ensure the data they purchase is accurate and error-free, as inaccurate data can negatively impact decision-making, analysis, and overall performance.
  • Completeness: The dataset should be comprehensive and contain all the relevant information to address the business's specific requirements.
  • Consistency: To facilitate efficient integration and analysis, all data sources and records must follow uniform formats, naming conventions, and measurement units.
  • Timeliness: Up-to-date and relevant data is essential, as outdated or stale data may not provide the desired insights and lead to wrong decisions.


How do we ensure high-quality data?

Our validation process consists of several stages, each focusing on a different data collection aspect.

Stage #1 Accuracy: Schema Validation

The first step is to define each field's schema and expected output. Each collected record goes through schema validation. Is it the right data type? Is this field mandatory or empty?

During setup, we define the field schema and expected output

  • Data type (e.g., string, numeric, bool, date)
  • Mandatory fields (e.g., ID)
  • Common fields (e.g., price, currency, star rating)
  • Custom field validation

The dataset is created after the records are validated based on the defined schema and field output.

Example: For a field like is_active, which is expected to be boolean, the validation will check whether the value is True or False. The validation will fail if a value is 'Yes,' 'No,' or any other value.

Stage #2 Completeness: Dataset Statistics

This stage evaluates the dataset's key statistical attributes to ensure data quality, completeness, and consistency.

  • Filling rate (%): Assesses the dataset's overall filling rate against expected (based on sample statistics) values for each field. Filling values must meet a minimum percentage.
  • Unique values (#): Ensures that any field and the unique ID values meet the required validation criteria, i.e., the number of unique values against expected. The dataset must contain a minimum percentage of unique values.
  • Dataset size Minimum Records Threshold (#): Reflects the number of expected records. Minimum X records are required for the initial dataset, and fluctuation within +/- 10% is checked.
  • Persistence Validation: Once a field is populated, it becomes mandatory and cannot be left empty in subsequent entries. This ensures data consistency and completeness. If an attempt is made to leave the field empty after initial data entry, an error is triggered, prompting the user to provide the necessary information or justify the omission.
  • Type Verification: Rigorously checks the data type of each entry against the designated field type, be it string, number, date, etc. This ensures data integrity and prevents potential mismatches or errors during data processing. When a mismatch is detected, the system flags it for correction before further processing.

As we transition from assessing the dataset's statistical properties in Stage 2, we move on to implementing a process for updating and maintaining the dataset in Stage 3, which ensures its continued relevance and accuracy over time.

Stage #3 Continuous Monitoring

  • The final data validation stage refers to maintaining the dataset based on website structure changes and updated or new records. This stage ensures the relevancy and accuracy of the dataset over time.
  • Identify errors and outliers by comparing newly collected data with previously collected data.
    Any validation failure will be reported to us via an alert mechanism.

Data is great only if it is reliable

With Bright Data, rest assured that your datasets are of the highest quality and integrity, resulting in improved insights and better informed decisions.