Installing scikit-learn in a Python Virtualenv to do General Purpose Machine Learning

Kaggle is a platform devoted to competitions for data scientists to get the most accurate prediction of real world problems.  Many of the competitions are using existing data to predict unknown outcomes.  They have report (need source) that Random Forest is used in a lot of the competitions to predict unseen outcomes.  We’ll use Random Forest to predict the rank that set of words will show up in the Google Play Store.

1) Switch to the Virtualenv that was Set Up Previously

I’m going to be running everything in a virtualenv so my packages are well isolated.  This complicates things a bit in the setup, but should make it cleaner for distribution later.

First, open up a dos Command Prompt.  Next, I’m going to change to the directory that has my virtualenv already setup from Term Frequency-Inverse Document Frequency Extraction of Data Stored in MongoDB.

cd C:\Documents and Settings\Chris\workspace\PlayPyNLP

Next I’ll activate the virtualenv.

Scripts\activate

Now my command line should show

(PlayPyNLP) C:\Documents and Settings\Chris\workspace\PlayPyNLP>

2) Install Python’s SciPy

SciPy is Python’s main libary for math, science, and engineering.  It is also a prerequisite for installing the scikit-learn Python machine learning library.

If you haven’t already done so, you must install numpy since this is a prerequisite for scipy.

(PlayPyNLP) CommandPrompt> pip install numpy

Next, download and install the executables for scipy.  I tried to use pip install for this task initially, but I got an error about a missing Blas library (see more details in the “Things That Go Wrong section).

Go to http://sourceforge.net/projects/scipy/files/scipy and download the latest and greatest.  At the time of this writing the latest download link is http://sourceforge.net/projects/scipy/files/latest/download?source=files

Since we are using a virtualenv, it gets a little tricky.  The downloaded executable is just a thin wrapper around 3 possible executables.  Therefore you’ll have to unzip it and copy one of them out to install directly.  Go to the folder containing the downloaded .exe file, in my case scipy-0.11.0-win32-superpack-python2.7.  Now, right-click the file and go

7-Zip -> Extract Files

This will dump 4 files into your specified directory.  The one you want is sse3, where sse is a level of optimization for the library based on your computer’s hardware.  Now, go back to your Command Prompt where the virtualenv activated and navigate to the extracted directory.  Use easy_install to install this file.  Note that I tried like a banshee to get ‘pip install’ to install SciPy, but in the end it was a lot less work to use the easy_install.

(PlayPyNLP) CommandPrompt> easy_install scipy-0.11.0-sse3.exe

Hopefully the messaging will report a successful installation.  You can check this out by trying to import the library.

(PlayPyNLP) CommandPrompt> python

(PlayPyNLP) PythonInterpretter> import scipy

If no error is reported this means it’s working.

3) Install Python’s Machine Learning Library: scikit-learn

Next, install the downloaded file with pip install.  Note that if you are not using a virtualenv you can just run the executable.

(PlayPyNLP) CommandPrompt> pip install scikit-learn

Hopefully the messaging will report a successful installation.  You can check this out by trying to import the library.

(PlayPyNLP) PythonInterpretter> python

PythonInterpretter>import sklearn

If no error is reported this means it’s working.

 

Ready for more fun?  How about checking out:

Python in Eclipse with PyDev

Things That Go Wrong

1) Error when in trying to use easy_install for scikit-learn

When I first tried to install scikit-learn I got the error:

“error: Couldn’t find a setup script in c:…easy_install-bpyisadownload” 

This happened because I was just using easy_install and hoping the network would find it.  I also tried using pip and got the same result.  In order to work around this I downloaded the windows executable and used easy_install on that file instead.

2) Blas libraries not found

error:

Blas (http://www.netlib.org/blas) libraries not found.

This occurred because I was trying to use easy_install to install scipy whilst being in a virtualenv.  I also tried using ‘pip install scipy’ and got basically the same error.  In order to get around this I used easy_install directly on the downloaded and unzipped file from

http://sourceforge.net/projects/scipy/files/scipy/

For more details checkout the stackoverflow post:

http://stackoverflow.com/questions/6114115/windows-virtualenv-pip-numpy-problems-when-installing-numpy

3) ImportError: No module named scipy

Because I’m doing my development in a virtualenv, after I installed scipy it could still not be found.  If I deactivated the virtualenv it could be found.

In order to fix this I need to activate the virtualenv and slog through step 2 above.

Speak Your Mind

*

DSCF0006