I’m trying to come up with some interesting large data sets for use in extra-credit work using predictive analytics, especially something that has interactions and/or need for dummy variables. Anyone out there point me in the right direction?

  1. Hi Dawn! I have a few data sets that I use for my group projects available here:


    The one most likely to be useful for dummy variable purposes is the PetHealth website data, which is made up, but using pretty sophisticated stuff (Weibull distribution to randomize site length visits, for example), and I made it messy (added pipes and NaNs) to give students some experience with cleaning data. It’s pretty fast to clean if you know what you’re doing. Feel free to modify, use, share at will.


    1. Hi Jason, I first want to apologize for my slow response. I’ve been having problems with my spam filter and somehow your post ended up in a big queue. Anyway, thanks for the data. I’ll check it out soon.

