Good data set for Pre-processing
I am enrolled in an under-graduate course in Data Mining and I've got an assignment to code a Data Mining Pre-processor. I have the liberty to choose the programming language and the data set. I was wondering if anybody could suggest a good data set to use. I have been going through the UCI Repository and I've found many more such resources. But being a beginner I am not sure which data set would be a good choice. The preprocessor should be dealing with the following stuff: Data cleaning Missing Values Errors Outliers Nomralization De-duplication Data Reduction Sampling Techniques Dimensionality Reduction What kind of properties should I consider when choosing the data set? Any specific data set you would suggest?
You answered your own question. Choose list of data-set with the properties that you have mentioned as UCI repository has categorized dataset. You can chose anyone to start playing with it. So to start with, if I were you,I would proceed step wise, have a feel how each of those look like and its effect on classifier performance and choose some of the popular dataset as they are used as benchmark dataset in most of the research paper. Much of those that you have listed are separate machine learning problems with lots of researches being conducted. I would start with something like this : for missing values : Iris, Voting,Heart disease for Duplicate:921,810 song dataset(not form UCI I think) Normalization : Any continuous valued dataset with different range for features Sampling technique : Pima Dimensionality reduction : Swiss Roll Further, another best approach to look for the data set would be to refer some of respective publications. Such as , for dimensionality reduction, you can look into papers of PCA, ISOMAP etc, for sampling see SMOTE paper etc and see what type of data do they use for their experiments and proceed accordingly.
GNU-C-preprocessing FORTRAN source to change array indices causes recursion whilst expanding macro
indirectly quoting macro in traditional mode
KMP Preprocessing Function
How to detect Apportable with preprocessor flags?
Adjust data vector to have certain variance
Pre-process Laravel 4 view
File iteration with the preprocessor
processing strings of text for neural network input
How to Set Active Target in XCode 4.5+
Preprocessor assertions in arduino
expanding a macro within a macro definition
GNU-M4: Strip empty lines
Escaping space in OpenCL compiler arguments
Does Corona have preprocessor statements so I can execute Lua code for specific devices?
Preprocessing in C++
Truncate string in cpp (preprocessor)