In a previous blog, we discussed the types of data that are available in multiple industries. We also discussed the need to provide security to the data. In this blog, we will look at how we can implement De-Identification of data to avoid the data being traced back to the individual.
De-Identified data is a type of data where any personally identifiable information has been removed. It can be information attributes that include first and last names, residential addresses, Social Security Numbers, and various information types. We must take adequate care to make sure that this data should not reveal the information about the individual either advertently or inadvertently
Table of Contents
- Need for de-identified data
- Application for de-identification
- Types of large healthcare data from multiple sources
- How de-identification is related to health & personal data protection
- De-identification techniques
- De-identification guidelines issued by research institutions to their researchers
Need for de-identified data
We will need to de-identify data for the following reasons.
- Organizations can mitigate the risks of privacy breaches for their clients. They can also reduce their liability in the event of a privacy breach.
- De-identification helps in the data not getting into the hands of people who are not authorized to see the data.
- De-identification helps in exposing the data to a variety of market or health research. This research can happen while at the same time the data is protected to such an extent that no one can trace it to an individual.
Application for de-identification
Let us discuss the use of personal information in a couple of industries. These could be Education and Healthcare.
In education, the data can be analyzed in multiple ways. These could include how many students enrolled in a specific course, the subjects they have chosen, the marks they secured, the areas that are problematic and the areas they are comfortable in those courses, to name a few. This research will help the educational agencies to make decisions on what sort of futuristic classes the society needs.
In healthcare, data is generated, stored, and exchanged between various healthcare devices to multiple data sources and ecosystems. Bring Your Own Devices (BYOD) also has made data sharing very easy. This data sharing leads to data privacy as a big concern.
The types of large healthcare data combine complex data from multiple sources
There is a range of information systems, medical devices, and external organization partners in any healthcare organization. All of these transfer data from one another. The method includes the following. Hospital management systems, Electronic health records, CRM applications, Marketing automation systems, Pharmacy management systems, Medicare claims, Insurance applications, General Practitioner applications, Multiple devices that capture patient information, and Managed care applications, to name a few.
How de-identification is related to health & personal data protection, privacy, security, etc related regulations, and acts -for e.g. HIPAA (USA) or GDPR (EU) or of specific regulations in other countries
The US government drafted the data Privacy Rule to safeguard identifiable health information about individuals. The rule however, allows a business associate or a covered entity to create knowledge out of the data that is available to them. Yet, they have to ensure that the information they make is not traceable to any individual. They can follow the de-identification standard and implementation specifications.
The Privacy Rule outlines two methods.
- Formal determination by a qualified expert
- Removal of specified identifiers of the individual.
This ensures an absence of actual knowledge by the covered entity. The remaining information can be used with the other information to identify the individual.
We will now discuss the two techniques in de-identification according to the HIPAA privacy rule.
Expert Determination Method
An expert has the appropriate knowledge and experience with accepted statistical and scientific principles for rendering information that cannot be traced back to an individual. The risk gets minimized, and the data can be used either alone or in combination with other available information. The methods used to make the justification by experts should be documented and retained by the covered entity. This documentation has to be made available to the regulator in case of any investigation or an audit.
Safe Harbor Method
We will have to look at two types of identifiers of data. Direct and Indirect. A few potential direct identifiers of patient information are Name, Geographical information, Dates (except year), Telephone numbers, Information of vehicles, device, email address, SSN) to name a few. A few potential indirect identifiers of patient information are Race, Age, Date of Birth, Salary, Educational qualifications to name a few. In ZIP Codes, the first three-digits can be used if the population is more than 20,000 people. If not, the ZIP codes should be 000. This coding will ensure that traceability to individuals is not possible.
De-identification guidelines issued by research institutions to their researchers
In a nutshell, de-identification is a high-impact process as the impact of data disclosure is very high. We will discuss one such example. We will take the guidelines of the National Institute of Standards and Technology. The NIST guidelines identify four types of media standardization to employ with different data security categories on various storage devices. The four types of sanitization are Disposal, Clearing, Purging, and Destroying. Let us discuss a bit more on each of the four types
Disposal � This means that the media is discarded
Clearing � This means that the data on the media cannot be read by overwriting on the media
Purging � This means that the data is removed even from laboratory-grade attacks
Destroying � This means that the media is made unusable
Now comes the next question � How to implement a process for Re-identification. We can assign unique codes to the sets of de-identified information. This will allow the covered entity to facilitate reidentification. Health information (as long as it is PHI) related to any individual is protected by the Privacy Rule. Hence a covered entity may assign a code or some other means to allow the de-identified information to be re-identified. This will ensure that the code does not contain any trace that leads towards the individual. This also means that the covered entity does not divulge or disclose the code for any other purpose and also does not disclose the re-identification method.
The risk in re-identification of the data is that the information may fall into the wrong hands. This might lead to the individuals identified in the process. How can this happen? We have to be careful about the de-identification process itself. The direct identifiers and the quasi-identifiers as discussed above should be removed from the shared data.
Now, let us understand the types of re-identification attempts. They can be a deliberate attempt, inadvertent attempt, or data breach. In the case of the release of the data to known individuals under contract, the re-identification attempt should be considered. However, in the case of a public data release, the re-identifiers should be completely avoided.
The process of matching de-identified data with publicly available data to trace back to an individual the data belongs to is called reidentification or re-anonymization. The risk of re-identification is reduced with GDPR compliant pseudonymization. GDPR incorporates data protection by design because it requires the protection of both direct and indirect identifiers. This ensures that the data is not cross-referenceable without access to the separately kept additional information. A 2000 study implies that more than 80% of the population can be identified using a combination of their zip code, gender, and birth date. The risk of re-identification should be reduced drastically by masking information and not letting it be traced to the individual.
To conclude, we have to be very careful when dealing with the data of any individual or any research. Extreme care should be taken care to ensure that the data cannot be traced back to any individual when we encode the data. You must ensure that your partners implement similar standards in exposing the data to the external world.