Kaggle is a platform devoted to competitions for data scientists to get the most accurate prediction of real world problems. Many of the competitions are using existing data to predict unknown outcomes. They have report (need source) that Random Forest is used in a lot of the competitions to predict unseen outcomes. We’ll use Random Forest to predict the rank that set of words will show up in the Google Play Store.
1) Switch to the Virtualenv that was Set Up Previously
I’m going to be running everything in a virtualenv so my packages are well isolated. This complicates things a bit in the setup, but should make it cleaner for distribution later.
First, open up a dos Command Prompt. Next, I’m going to change to the directory that has my virtualenv already setup from Term Frequency-Inverse Document Frequency Extraction of Data Stored in MongoDB.
cd C:\Documents and Settings\Chris\workspace\PlayPyNLP
Next I’ll activate the virtualenv.
Scripts\activate
Now my command line should show
(PlayPyNLP) C:\Documents and Settings\Chris\workspace\PlayPyNLP>
2) Install Python’s SciPy
SciPy is Python’s main libary for math, science, and engineering. It is also a prerequisite for installing the scikit-learn Python machine learning library.
If you haven’t already done so, you must install numpy since this is a prerequisite for scipy.
(PlayPyNLP) CommandPrompt> pip install numpy
Next, download and install the executables for scipy. I tried to use pip install for this task initially, but I got an error about a missing Blas library (see more details in the “Things That Go Wrong section).
Go to http://sourceforge.net/projects/scipy/files/scipy and download the latest and greatest. At the time of this writing the latest download link is http://sourceforge.net/projects/scipy/files/latest/download?source=files
Since we are using a virtualenv, it gets a little tricky. The downloaded executable is just a thin wrapper around 3 possible executables. Therefore you’ll have to unzip it and copy one of them out to install directly. Go to the folder containing the downloaded .exe file, in my case scipy-0.11.0-win32-superpack-python2.7. Now, right-click the file and go
7-Zip -> Extract Files
This will dump 4 files into your specified directory. The one you want is sse3, where sse is a level of optimization for the library based on your computer’s hardware. Now, go back to your Command Prompt where the virtualenv activated and navigate to the extracted directory. Use easy_install to install this file. Note that I tried like a banshee to get ‘pip install’ to install SciPy, but in the end it was a lot less work to use the easy_install.
(PlayPyNLP) CommandPrompt> easy_install scipy-0.11.0-sse3.exe
Hopefully the messaging will report a successful installation. You can check this out by trying to import the library.
(PlayPyNLP) CommandPrompt> python
(PlayPyNLP) PythonInterpretter> import scipy
If no error is reported this means it’s working.
3) Install Python’s Machine Learning Library: scikit-learn
Next, install the downloaded file with pip install. Note that if you are not using a virtualenv you can just run the executable.
(PlayPyNLP) CommandPrompt> pip install scikit-learn
Hopefully the messaging will report a successful installation. You can check this out by trying to import the library.
(PlayPyNLP) PythonInterpretter> python
PythonInterpretter>import sklearn
If no error is reported this means it’s working.
Ready for more fun? How about checking out:
Things That Go Wrong
1) Error when in trying to use easy_install for scikit-learn
When I first tried to install scikit-learn I got the error:
“error: Couldn’t find a setup script in c:…easy_install-bpyisadownload”
This happened because I was just using easy_install and hoping the network would find it. I also tried using pip and got the same result. In order to work around this I downloaded the windows executable and used easy_install on that file instead.
2) Blas libraries not found
error:
Blas (http://www.netlib.org/blas) libraries not found.
This occurred because I was trying to use easy_install to install scipy whilst being in a virtualenv. I also tried using ‘pip install scipy’ and got basically the same error. In order to get around this I used easy_install directly on the downloaded and unzipped file from
http://sourceforge.net/projects/scipy/files/scipy/
For more details checkout the stackoverflow post:
3) ImportError: No module named scipy
Because I’m doing my development in a virtualenv, after I installed scipy it could still not be found. If I deactivated the virtualenv it could be found.
In order to fix this I need to activate the virtualenv and slog through step 2 above.
Speak Your Mind