preprocessor


Good data set for Pre-processing


I am enrolled in an under-graduate course in Data Mining and I've got an assignment to code a Data Mining Pre-processor. I have the liberty to choose the programming language and the data set. I was wondering if anybody could suggest a good data set to use. I have been going through the UCI Repository and I've found many more such resources. But being a beginner I am not sure which data set would be a good choice. The preprocessor should be dealing with the following stuff:
Data cleaning
Missing Values
Errors
Outliers
Nomralization
De-duplication
Data Reduction
Sampling Techniques
Dimensionality Reduction
What kind of properties should I consider when choosing the data set? Any specific data set you would suggest?
You answered your own question. Choose list of data-set with the properties that you have mentioned as UCI repository has categorized dataset. You can chose anyone to start playing with it.
So to start with, if I were you,I would proceed step wise, have a feel how each of those look like and its effect on classifier performance and choose some of the popular dataset as they are used as benchmark dataset in most of the research paper. Much of those that you have listed are separate machine learning problems with lots of researches being conducted.
I would start with something like this :
for missing values : Iris, Voting,Heart disease
for Duplicate:921,810 song dataset(not form UCI I think)
Normalization : Any continuous valued dataset with different range for features
Sampling technique : Pima
Dimensionality reduction : Swiss Roll
Further, another best approach to look for the data set would be to refer some of respective publications. Such as , for dimensionality reduction, you can look into papers of PCA, ISOMAP etc, for sampling see SMOTE paper etc and see what type of data do they use for their experiments and proceed accordingly.

Related Links

GNU-C-preprocessing FORTRAN source to change array indices causes recursion whilst expanding macro
indirectly quoting macro in traditional mode
KMP Preprocessing Function
How to detect Apportable with preprocessor flags?
Adjust data vector to have certain variance
Pre-process Laravel 4 view
File iteration with the preprocessor
processing strings of text for neural network input
How to Set Active Target in XCode 4.5+
Preprocessor assertions in arduino
expanding a macro within a macro definition
GNU-M4: Strip empty lines
Escaping space in OpenCL compiler arguments
Does Corona have preprocessor statements so I can execute Lua code for specific devices?
Preprocessing in C++
Truncate string in cpp (preprocessor)

Categories

HOME
embedded
xml
eclipse
fabricjs
delphi-7
string
wakanda
datastore
twig
numbers
soa
permissions
babeljs
interpolation
nfs
performancepoint
vimdiff
activecollab
datastage
endpoint
symbol
do-while
event-sourcing
vuejs
google-content-api
connector
asp.net-core-webapi
microsoft-ocr
undertow
roslyn
epsilon
monit
highstock
gmock
pic32
asymptotic-complexity
constexpr
powershell-ise
mockjax
record
classcastexception
nested-lists
batching
apic
google-knowledge-graph
flatmap
c9.io
resolution
levenshtein-distance
dcmtk
attributeerror
install.packages
lpsolve
congestion-control
uiimage
jide
dotnet-httpclient
vugen
lightning
d3-force-directed
renaming
pytables
reporting-services-2012
model-associations
scaffolding
cfchart
tracker-enabled-dbcontext
project-organization
sklearn-pandas
exim4
ascii-art
rate-limiting
mod-proxy
cwrsync
umlgraph
openbabel
variadic-templates
structuremap3
cuba
openargs
hibernate-entitymanager
scmmanager
stacky
tree-balancing
sentestingkit
web-farm
kqueue
stress
gobject-introspection
htdocs
u2netdk
clrstoredprocedure
w3c-geolocation
tmx
undefined-index
silent
feasibility
application-planning

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App