COMP 5704

Parallel Algorithms and Applications in Data Science

Project Title: Top-Down Specialization on Apache Spark™

Name: Macarious Abadeer

School of Computer Science
Carleton University, Ottawa, Canada


Project Outline

Privacy-Preserving Data Publishing (PPDP) is an ongoing field of research that involves de-identification of data so it can be shared for secondary use such as analytics and health care research while minimizing information loss.
Balancing data utility and data privacy is a challenging problem and this research project intends to analyze the implementation of Top-Down Specialization anonymization algorithm on a Spark™ cluster. Top-Down Specialization is a technique where values are specialized from the most generic to the most specialized until k-anonymity is violated.
Top-Down Specialization is one of the methods recommended for anonymizing datasets with large k value requirements. The larger the k the more anonymous the dataset is.

Startup Reference Paper(s)

  1. U. Sopaoglu and O. Abul. A top-down k-anonymization implementation for apache spark. In 2017 IEEE International Conference on Big Data (Big Data), pages 4513– 4521, December 2017.

Deliverables

Relevant References

  1. Clash of the titans: MapReduce vs. Spark for large scale data analytics
  2. A Parallel Method for Scalable Anonymization of Transaction Data
  3. Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression.
  4. L-diversity: Privacy beyond k-anonymity
  5. t-Closeness: Privacy Beyond k-Anonymity and l-Diversity
  6. Scalable, Efficient Anonymization with INCOGNITO - Framework & Algorithm
  7. Bottom-up generalization: a data mining solution to privacy protection
  8. Big data anonymization with spark
  9. On the Complexity of Optimal K-Anonymity
  10. An Indexed Bottom-up Approach for Publishing Anonymized Data
  11. Experimenting sensitivity-based anonymization framework in apache spark
  12. Combining Top-Down and Bottom-Up: Scalable Sub-tree Anonymization over Big Data Using MapReduce on Cloud
  13. Top-down specialization for information and privacy preservation
  14. An Advanced Bottom up Generalization Approach for Big Data on Cloud
  15. Privacy preserving big data publishing: a scalable k-anonymization approach using MapReduce
  16. A Multi-level Clustering Approach for Anonymizing Large-Scale Physical Activity Data
  17. A Scalable Two-Phase Top-Down Specialization Approach for Data Anonymization Using MapReduce on Cloud
  18. Data Anonymization Using Map Reduce on Cloud based A Scalable Two-Phase Top-Down Specialization