Using NLP and Machine Learning to Detect Data Privacy Violations

Silva, Paulo; Goncalves, Carolina; Godinho, Carolina; Antunes, Nuno; Curado, Marília

doi:10.1109/INFOCOMWKSHPS50562.2020.9162683

Please use this identifier to cite or link to this item: https://hdl.handle.net/10316/93821

DC Field	Value	Language
dc.contributor.author	Silva, Paulo	-
dc.contributor.author	Goncalves, Carolina	-
dc.contributor.author	Godinho, Carolina	-
dc.contributor.author	Antunes, Nuno	-
dc.contributor.author	Curado, Marília	-
dc.date.accessioned	2021-03-20T09:54:54Z	-
dc.date.available	2021-03-20T09:54:54Z	-
dc.date.issued	2020	-
dc.identifier.isbn	978-1-7281-8695-5	-
dc.identifier.issn	978-1-7281-8695-5 (eISSN)	-
dc.identifier.issn	978-1-7281-8696-2	-
dc.identifier.uri	https://hdl.handle.net/10316/93821	-
dc.description.abstract	Privacy concerns are constantly increasing in different sectors. Regulations such as the EU's General Data Protection Regulation (GDPR) are pressuring organizations to handle the individual's data with reinforced caution. As information systems deal with increasingly large amounts of personal data in essential services, there is a lack of mechanisms to help organizations in protecting the involved data subjects. In this paper, we propose and evaluate the use of Named Entity Recognition as a way to identify, monitor and validate Personally Identifiable Information. In our experiments, we used three of the most well-known Natural Language Processing tools (NLTK, Stanford CoreNLP, and spaCy). First, we assess the effectiveness of the tools with a generic dataset. Then, machine learning models are trained and evaluated with datasets built on data that contain personally identifiable information. The results show that models' performance was highly positive in accurately classifying both generic and more context-specific data. We observe the relationship between the datasets' training size and respective performance and estimate the appropriate size for model training within this context. Furthermore, we discuss how our proposal can effectively act as a Privacy Enhancing Technology as well as the potential risks and associated impacts.	-
dc.language.iso	eng	-
dc.publisher	IEEE	-
dc.relation	info:eu-repo/grantAgreement/EC/H2020/786713/EU/Protection and control of Secured Information by means of a privacy enhanced Dashboard	-
dc.rights	openAccess	-
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	-
dc.title	Using NLP and Machine Learning to Detect Data Privacy Violations	-
dc.type	article	-
degois.publication.firstPage	972	-
degois.publication.lastPage	977	-
degois.publication.location	Toronto	-
degois.publication.title	IEEE INFOCOM 2020 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS)	-
dc.relation.publisherversion	https://doi.org/10.1109/INFOCOMWKSHPS50562.2020.9162683	-
dc.peerreviewed	yes	-
dc.identifier.doi	10.1109/INFOCOMWKSHPS50562.2020.9162683	-
dc.date.embargo	2021-12-31	*
uc.date.periodoEmbargo	730	-
item.fulltext	Com Texto completo	-
item.openairecristype	http://purl.org/coar/resource_type/c_18cf	-
item.grantfulltext	open	-
item.languageiso639-1	en	-
item.openairetype	article	-
item.cerifentitytype	Publications	-
crisitem.author.researchunit	CISUC - Centre for Informatics and Systems of the University of Coimbra	-
crisitem.author.parentresearchunit	Faculty of Sciences and Technology	-
crisitem.author.orcid	0000-0001-6760-4675	-
Appears in Collections:	FCTUC Eng.Informática - Artigos em Revistas Internacionais

Files in This Item:

File	Description	Size	Format
WORKSHOP_ON_SECURITY_AND_PRIVACY_IN_BIG_DATA__Camera_Ready.pdf		1.14 MB	Adobe PDF	View/Open

Show simple item record

Files in This Item:

Google Scholar^TM

Altmetric

Altmetric

Files in This Item:

Google ScholarTM

Altmetric

Altmetric

Google Scholar^TM