Résumé | Multirelational classification algorithms aim to discover patterns across multiple interlinked tables in a relational database. However, when considering a complex database schema, it becomes difficult to identify all possible relationships between attributes. This is because a database often contains a very large number of attributes which come from different interconnected tables with non-determinate (such as one-to-many) relationships. A set of seemingly harmless attributes across multiple tables, therefore, may be used to learn unwanted classification models to accurately determine confidential information, leading to data leaks. Furthermore, eliminating or distorting confidential attributes may be insufficient to prevent such data disclosure, since values may be inferred based on prior insider knowledge. This paper proposes an approach to identify such "dangerous" attribute sets. For data publishing, our method generates a ranked list of subschemas which maintain the predictive performance on the class attribute, while limiting the disclosure risk, and predictive accuracy, of confidential attributes. We demonstrate the effectiveness of our method against several databases. |
---|