Finding the Critical Feature Dimension of Big Datasets

Silva, José Miguel Parreira e

Please use this identifier to cite or link to this item: https://hdl.handle.net/10316/82847

DC Field	Value	Language
dc.contributor.advisor	Ribeiro, Bernardete Martins	-
dc.contributor.advisor	Teixeira, César Alexandre Domingues	-
dc.contributor.author	Silva, José Miguel Parreira e	-
dc.date.accessioned	2018-12-22T18:38:04Z	-
dc.date.available	2018-12-22T18:38:04Z	-
dc.date.issued	2017-07-14	-
dc.date.submitted	2019-01-22	-
dc.identifier.uri	https://hdl.handle.net/10316/82847	-
dc.description	Dissertação de Mestrado em Engenharia Informática apresentada à Faculdade de Ciências e Tecnologia	-
dc.description.abstract	Big Data allied to the Internet of Things nowadays provides a powerful resource that various organizations are increasingly exploiting for applications ranging from decision support, predictive and prescriptive analytics, to knowledge and intelligence discovery. In analytics and data mining processes, it is usually desirable to have as much data as possible, though it is often more important that the data is of high quality thereby raising two of the most important problems when handling large datasets: sample and feature selection. This work addresses the sampling problem and presents a heuristic method to find the “critical sampling” of big datasets. The concept of the critical sampling size of a dataset is defined as the minimum number of examples that are required for a given data analytic task to achieve a satisfactory performance. The problem is very important in data mining, since the size of data sets directly relates to the cost of executing the data mining task. Since the problem of determining the optimal solution for the Critical Sampling Size problem is intractable, in this dissertation a heuristic method is tested, in order to infer its capability to find practical solutions. Results have shown an apparent Critical Sampling Size for all the tested datasets, which is rather smaller than the their original sizes. Further, the proposed heuristic method shows a promising utility, providing a practical solution to find a useful critical sample for data mining tasks.	por
dc.description.abstract	Big Data allied to the Internet of Things nowadays provides a powerful resource that various organizations are increasingly exploiting for applications ranging from decision support, predictive and prescriptive analytics, to knowledge and intelligence discovery. In analytics and data mining processes, it is usually desirable to have as much data as possible, though it is often more important that the data is of high quality thereby raising two of the most important problems when handling large datasets: sample and feature selection. This work addresses the sampling problem and presents a heuristic method to find the “critical sampling” of big datasets. The concept of the critical sampling size of a dataset is defined as the minimum number of examples that are required for a given data analytic task to achieve a satisfactory performance. The problem is very important in data mining, since the size of data sets directly relates to the cost of executing the data mining task. Since the problem of determining the optimal solution for the Critical Sampling Size problem is intractable, in this dissertation a heuristic method is tested, in order to infer its capability to find practical solutions. Results have shown an apparent Critical Sampling Size for all the tested datasets, which is rather smaller than the their original sizes. Further, the proposed heuristic method shows a promising utility, providing a practical solution to find a useful critical sample for data mining tasks.	eng
dc.language.iso	eng	-
dc.rights	openAccess	-
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/4.0/	-
dc.subject	Big Data	por
dc.subject	Critical Sample	por
dc.subject	Data Mining	por
dc.subject	Big Data	eng
dc.subject	Critical Sample	eng
dc.subject	Data Mining	eng
dc.title	Finding the Critical Feature Dimension of Big Datasets	eng
dc.title.alternative	Procura do Tamanho Crítico de Amostragem de Grandes Conjuntos de Dados	por
dc.type	masterThesis	-
degois.publication.location	DEI-FCTUC	-
degois.publication.title	Finding the Critical Feature Dimension of Big Datasets	eng
dc.peerreviewed	yes	-
dc.identifier.tid	202124010	-
thesis.degree.discipline	Informática	-
thesis.degree.grantor	Universidade de Coimbra	-
thesis.degree.level	1	-
thesis.degree.name	Mestrado em Engenharia Informática	-
uc.degree.grantorUnit	Faculdade de Ciências e Tecnologia - Departamento de Engenharia Informática	-
uc.degree.grantorID	0500	-
uc.contributor.author	Silva, José Miguel Parreira e::0000-0003-0284-4429	-
uc.degree.classification	18	-
uc.degree.presidentejuri	Correia, António Dourado Pereira	-
uc.degree.elementojuri	Ribeiro, Bernardete Martins	-
uc.degree.elementojuri	Fonseca, Carlos Manuel Mira da	-
uc.contributor.advisor	Ribeiro, Bernardete Martins::0000-0002-9770-7672	-
uc.contributor.advisor	Teixeira, César Alexandre Domingues::0000-0001-9396-1211	-
uc.controloAutoridade	Sim	-
item.fulltext	Com Texto completo	-
item.openairecristype	http://purl.org/coar/resource_type/c_18cf	-
item.languageiso639-1	en	-
item.openairetype	masterThesis	-
item.cerifentitytype	Publications	-
item.grantfulltext	open	-
crisitem.advisor.researchunit	CISUC - Centre for Informatics and Systems of the University of Coimbra	-
crisitem.advisor.researchunit	CISUC - Centre for Informatics and Systems of the University of Coimbra	-
crisitem.advisor.parentresearchunit	Faculty of Sciences and Technology	-
crisitem.advisor.parentresearchunit	Faculty of Sciences and Technology	-
crisitem.advisor.orcid	0000-0002-9770-7672	-
crisitem.advisor.orcid	0000-0001-9396-1211	-
Appears in Collections:	UC - Dissertações de Mestrado

Files in This Item:

File	Description	Size	Format
dissertation.pdf		2.09 MB	Adobe PDF	View/Open

Show simple item record

Page view(s) 50

578

checked on Oct 16, 2024

Download(s) 50

640

checked on Oct 16, 2024

Google Scholar^TM

Check

This item is licensed under a Creative Commons License

Files in This Item:

Page view(s) 50

Download(s) 50

Google ScholarTM

Google Scholar^TM