Good data set for Pre-processing
I am enrolled in an under-graduate course in Data Mining and I've got an assignment to code a Data Mining Pre-processor. I have the liberty to choose the programming language and the data set. I was wondering if anybody could suggest a good data set to use. I have been going through the UCI Repository and I've found many more such resources. But being a beginner I am not sure which data set would be a good choice. The preprocessor should be dealing with the following stuff: Data cleaning Missing Values Errors Outliers Nomralization De-duplication Data Reduction Sampling Techniques Dimensionality Reduction What kind of properties should I consider when choosing the data set? Any specific data set you would suggest?
You answered your own question. Choose list of data-set with the properties that you have mentioned as UCI repository has categorized dataset. You can chose anyone to start playing with it. So to start with, if I were you,I would proceed step wise, have a feel how each of those look like and its effect on classifier performance and choose some of the popular dataset as they are used as benchmark dataset in most of the research paper. Much of those that you have listed are separate machine learning problems with lots of researches being conducted. I would start with something like this : for missing values : Iris, Voting,Heart disease for Duplicate:921,810 song dataset(not form UCI I think) Normalization : Any continuous valued dataset with different range for features Sampling technique : Pima Dimensionality reduction : Swiss Roll Further, another best approach to look for the data set would be to refer some of respective publications. Such as , for dimensionality reduction, you can look into papers of PCA, ISOMAP etc, for sampling see SMOTE paper etc and see what type of data do they use for their experiments and proceed accordingly.
Discriminating between (small) numbers and everything else in C preprocessor
Defining preprocessor symbols for CLion analyzer
Check multiple conditions at once using m4 preprocessor
XC8 warning: (107) illegal # directive “foo”
Compiling with ocamlbuild and camlp5
How not to output comments using the -C operation in mcpp
C++ Builder File Version not correct
How do you a preprocess statement for #include
Does the preprocessor pass environment variables?
YAML preprocessor / macro processor
Pre-Processing using m4
Is there a practical reason for “#if defined(X) && (X != 0)”?
How to check for presence of a directory in Inno Setup preprocessor?
What are analogs of “#ifdef”, “#ifndef”, “#else”, “#elif”, “#define”, “#undef” in D programming lnaguage?
Image pre-processing in OCR
which is more important, number of variables or subexpressions?