When Two are Better Than One: Synthesizing Heavily Unbalanced Data

Ferreira, Francisco; Lourenço, Nuno; Cabral, Bruno; Fernandes, João Paulo

doi:10.1109/ACCESS.2021.3126656

Utilize este identificador para referenciar este registo: https://hdl.handle.net/10316/101173

Campo DC	Valor	Idioma
dc.contributor.author	Ferreira, Francisco	-
dc.contributor.author	Lourenço, Nuno	-
dc.contributor.author	Cabral, Bruno	-
dc.contributor.author	Fernandes, João Paulo	-
dc.date.accessioned	2022-08-16T07:57:15Z	-
dc.date.available	2022-08-16T07:57:15Z	-
dc.date.issued	2021	-
dc.identifier.issn	2169-3536	pt
dc.identifier.uri	https://hdl.handle.net/10316/101173	-
dc.description.abstract	Nowadays, data is king and if treated and used properly it promises to give organizations a competitive edge over rivals by enabling them to develop and design Intelligent Systems to improve their services. However, they need to fully comply with not only ethical but also regulatory obligations, where, e.g., privacy (strictly) needs to be respected when using or sharing data, thus protecting both the interests of users and organizations. Fraud Detection systems are examples of such systems where Machine Learning algorithms leverage information to classify nancial transactions as legitimate or illicit. The data used to create these solutions is usually highly structured and contains categorical and continuous features characterised by complex distributions. One of the main challenges of fraud detection is concerned with the scarcity of fraudulent instances which results in highly unbalanced datasets. Additionally, privacy is crucial, and it is usually forbidden, or not possible, to share the data of organizations and individuals for creating or improving models. In this paper we propose a framework for private data sharing based on synthetic data generation using Generative Adversarial Networks (GAN) that learns the speci cities of nancial transactions data and generates ctitious data that keeps the utility of the original datasets. Our proposal, called Duo-GAN, uses two GAN generators to handle the data imbalance problem, one generator for fraudulent instances and the other for legitimate instances. With this approach, we observed, at most, a 5% disparity in F1 scores between classi ers trained and tested with actual data and the ones trained with synthetic data and tested with actual data.	pt
dc.language.iso	eng	pt
dc.relation	FCT - UID/CEC/00326/2020	pt
dc.relation	European Social Fund, through the Regional Operational Program Centro 2020; and in part by the Carnegie Mellon University (CMU)\|Portugal Project autonomiC plAtform for MachinE Learning using anOnymized daTa (CAMELOT) under Grant POCI-01-0247-FEDER-045915	pt
dc.rights	openAccess	pt
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/	pt
dc.subject	Fraud detection	pt
dc.subject	generative adversarial networks	pt
dc.subject	privacy	pt
dc.subject	machine learning	pt
dc.subject	synthetic data generation	pt
dc.subject	tabular data	pt
dc.title	When Two are Better Than One: Synthesizing Heavily Unbalanced Data	pt
dc.type	article	-
degois.publication.firstPage	150459	pt
degois.publication.lastPage	150469	pt
degois.publication.title	IEEE Access	pt
dc.peerreviewed	yes	pt
dc.identifier.doi	10.1109/ACCESS.2021.3126656	pt
degois.publication.volume	9	pt
dc.date.embargo	2021-01-01	*
uc.date.periodoEmbargo	0	pt
item.fulltext	Com Texto completo	-
item.openairecristype	http://purl.org/coar/resource_type/c_18cf	-
item.languageiso639-1	en	-
item.openairetype	article	-
item.cerifentitytype	Publications	-
item.grantfulltext	open	-
crisitem.author.researchunit	CISUC - Centre for Informatics and Systems of the University of Coimbra	-
crisitem.author.researchunit	CISUC - Centre for Informatics and Systems of the University of Coimbra	-
crisitem.author.parentresearchunit	Faculty of Sciences and Technology	-
crisitem.author.parentresearchunit	Faculty of Sciences and Technology	-
crisitem.author.orcid	0000-0001-6060-4971	-
crisitem.author.orcid	0000-0001-9699-1133	-
Aparece nas coleções:	I&D CISUC - Artigos em Revistas Internacionais

Ficheiros deste registo:

Ficheiro	Descrição	Tamanho	Formato
When_Two_are_Better_Than_One_Synthesizing_Heavily_Unbalanced_Data.pdf		2.53 MB	Adobe PDF	Ver/Abrir

Mostrar registo em formato simples

Citações WEB OF SCIENCE^TM

1

Visto em 2/mai/2023

Visualizações de página

98

Visto em 16/out/2024

Downloads

147

Visto em 16/out/2024

Ficheiros deste registo:

Citações WEB OF SCIENCE^TM

Visualizações de página

Downloads

Google Scholar^TM

Altmetric

Altmetric

Ficheiros deste registo:

Citações WEB OF SCIENCETM

Visualizações de página

Downloads

Google ScholarTM

Altmetric

Altmetric

Citações WEB OF SCIENCE^TM

Google Scholar^TM