[AISWorld] Responses to "Resources for R and Python"

Jerry Flatto jflatto at uindy.edu
Tue Mar 29 12:42:37 EDT 2016

I recently posted a request for resources related to R and Python.  Thank
you to everyone who responded.  All the responses are provided below.


After some research and thinking, I am planning to go with Python in my
classes.  I am including my thoughts on how I arrived at this decision in
case this might be helpful to others in a similar situation.  Feel free to
email me at jflatto at uindy.edu <mailto:jflatto at uindy.edu>  to kick this
around or to tell me why I should rethink my plan.  While leaning towards
Python, I can still be swayed.  :-)


My business students generally do not know programming and are not generally
going to be statistical experts.  I do not see them pushing the boundaries
of data science but rather working for organizations who want to improve
their decision making process but will not be "bleeding edge" in most cases.


Rather, I see them spending time capturing data from various sources and
having to clean the data before the analysis.  As such, Python seems to be a
better fit.  I also see more natural language processing in the curriculum
which Python seems to handle better.  I incorporate Tableau in the
curriculum which helps with visualization. I do not have a philosophical
issue with open source versus commercial software; rather I do not want to
use commercial software so expensive that it will be very unlikely for my
students to have after they graduate.  Tableau is popular enough so that I
can easily see my students having it available.  Some of my other commercial
software is just "too expensive" for many companies to have. 


As for the option of teaching them R and Python, I am concerned that if I go
this route, the students will not get enough depth in either one to be


Some of the online discussion I have looked at for R versus Python include:











"No trees were harmed in the sending of this message; however, a large
of electrons were slightly inconvenienced..."

Dr. Jerry Flatto, Professor, Information Systems Department - School of

University of Indianapolis, Indianapolis, Indiana, USA
mailto:jflatto at uindy.edu


Confidentiality Notice: This communication and/or its content are for the
sole use of the intended recipient, and may be privileged, confidential, or
otherwise protected from disclosure by law.  If you are not the intended
recipient, please notify the sender and then delete all copies of it.
Unless you are the intended recipient, your use or dissemination of the
information contained in this communication may be illegal.



This is probably the best resource I found insofar:
https://cran.r-project.org/doc/contrib/Zhao_R_and_data_mining.pdf. And it is
available for free. 




You may find this useful:


> https://www.datacamp.com/community/tutorials/r-or-python-for-data-analysis





RStudio is a good IDE and the server version is free for universities.


Feel free to use my slides.



Chapters 14-18




Hi Jerry, in 2013 a graduate student and I developed a set of five R
tutorials that we submitted to some competition but never heard back about.
Your request reminded me of them, and I just uploaded them to the Teradata
University Network.  Have you been through there, yet, by the way?  It's
teradatauniversitynetwork.com and a lot of faculty upload their teaching
materials to share.

Here's the link to the material on TUN:







I would highly recommend DataCamp (https://www.datacamp.com/home), a site
with several online courses that specialize on R, statistics and analytics
(and to a lesser degree, Python). The format is short videos followed by
hands-on exercises hosted on their cloud R service. I haven't used it for
teaching, but this is what I've been using to learn R myself, and I find the
quality of the content and pedagogy to be excellent (with the sole exception
of their data.table course). The academic price is USD 9 per month, but a
couple of introductory courses are free, and the first chapter of every
course is free, so you can easily try it out.




Your choice to provide instruction in R is wise. I wish I had learned it
during my Ph.D.  I am learning it now.  The learning curve is steep at first
but R is much more powerful and flexible than SPSS.  


There are quite a few free books available in pdf format online. 


R for Beginners

The R Inferno.  http://www.burns-stat.com/pages/Tutor/R_inferno.pdf

Statistics with R is a webpage (http://zoonek2.free.fr/UNIX/48_R/all.html)
but they provide a pdf version of their site.

R tips. http://pj.freefaculty.org/R/Rtips.pdf





There are free books and resources available on specific topics as well. 


An Introduction to Statistical Learning with Applications in R.



There are some really good channels on youtube that provide instruction on








I have watched videos on youtube. I have taken courses on udemy. The best
instruction I have taken so far has been on www.datacamp.com
<http://www.datacamp.com> . Learning how to use R by watching videos is a
bit like learning mathematics by watching someone else do it.  The only real
way to learn is by doing it.  It is important to do a lot of exercises.
Datacamp allows me to do that. However, it is not free.  





This might not be aimed at the right audience for you, Jerry, but there are
some great resources for the beginner here:




With a free online book:







For Python, I recommend "Python for Data Analysis" by McKinney (O'Reilly
Media).  The author is the creator of the 'pandas' library, very useful for
data preparation, and he covers a bit of visualization as well.  If your
students need to start from scratch in the language, I've heard great
reviews of Learn Python the Hard Way (learnpythonthehardway.org), it's a
free text but they can pay a small fee for video lessons.


Definitely use Anaconda Scientific Python Distribution from
https://continuum.io.  It's free, and bundles the latest versions of Python
with all the commonly-used packages for data analysis and visualization.
Also, if students have Python already installed for work, Anaconda installs
a separate copy so it doesn't disrupt their current installation.  Best of
all, it has an "install for self" mode which means that students can install
it on computer lab computers without having Administrator access... so I can
bypass going to the university's IT department.





I think considering R/Python instead of SAS/ SPSS is a very good idea for
analytics programs. If  you are looking for books with  R, you may want to
consider the following books:

1. An Introduction to Statistical Learning with Applications in R by Gareth
James et.al.

2. R and Data Mining: Examples and Case Studies by Yanchang Zhao

3. Data mining and business analytics with R  by Johannes Ledolter



For the decision as to which one to use, that is really dependent on how
much analysis and mathematics will need to be used.


For heavy data analysis and mathematics, here are the recommended open
source options:


1. R 

2. Octave


When you are ready to take an algorithm to a production state or drop into
another application's workflow, python with either pylab or pandas packages
is the way to go.


For machine learning capabilities R, Octave are the best again for creating
the math, Python is the preferred application code to implement.


For resources, here are some courses/tutorials that my team has found







Although it is a bit dated at this point, I still really love Stanford's
course on machine learning.  This course does require some pretty heavy
mathematics/stats, so might want to brush up on those things before taking:




One thing you didn't request is how to visualize the isights or outputs.
For this you can certainly leverage the packages of R or Python to provide
some nice visualization capabilities; however, for more advanced and
explorable options, there is a java script library that has plugins for both
R & Python, D3.JS -- you might want to research this as well.



One very useful resource you may consider is the integrated development
environment for python by Jet Brains and it is free for Professors and
Students. You can find it here https://www.jetbrains.com/pycharm/



First, I think its great that you are moving towards open source, flexible
data analysis tools. This will really help your student's think about what
they are doing and let them be more creative. However, with that comes a
price: your student's need a modicum of comfort or ability to program or
think like a programmer to use these tools...there are no buttons to just
click on and pretty tables to view data. Its all through programming

Here are the books that I've found must useful.

Note: Unless noted otherwise, all the resources below have been made freely
available by their authors, but they are also available for purchase from
places like Amazon.com

R Programming Language Resources 

*	Books by Hadley Wickham (a Core R Team member who has developed a
lot of very useful utilities for R) 

*	 <http://r4ds.had.co.nz/> R for Data Science this is focused on
using R for statistics
*	 <http://adv-r.had.co.nz/> Advanced R this is focused on R as a
programming language, not on how to do statistics. 

*	 <https://cran.r-project.org/web/views/> CRAN Task Views this is a
page maintained by the R Project Team that thematically organizes the myriad
of packages in R.  

*	Pros: Well organized and has decent descriptions and links to many
*	Cons: Not exhaustive...more experimental or relatively new packages
are not always there (however, this may not be a bad thing) 

*	 <http://www.cookbook-r.com/> Cookbook for R takes a "just tell me
what to do" approach to many common tasks in R. 

Python Programming Resources
Note: There are currently two versions of python out there: Python 2 and
Python 3. Normally, the developers try to maintain backwards compatibility,
but they deviated from that principle for Python 3. The vast majority of
Python 2 code will run with Python 3, but there are a few gotchas. I've
included a reference that I think does a good job describing both languages.
I'd recommend having your students use Python 3, as it's where the language
is going.

*	 <https://docs.python.org/3.5/> Official Python 3 Documentation --
Decently written, comprehensive overview of Python's standard library.
*	Core External Packages for Data Analysis: Unlike R, Python's data
science toolkit is comprised of a few "mega packages" as opposed to many
small, focused packages. Also, these packages almost have a life of their
own, with their own conferences and generally well-documented, decent
looking web pages (unlike R's sparse help files). 

*	 <http://scipy.org/> Scipy.org: Not a package, but the SciPy
organization makes most of the packages below.
*	 <http://docs.scipy.org/doc/numpy-1.10.0/user/index.html> Numpy:
Convient array-like objects that are more user-friendly than Python base
arrays for numerical computations.
*	 <http://docs.scipy.org/doc/scipy/reference/> Scipy: The  package
for scientific computing...has tons of stuff from calculus to statistics to
image processing and linear algebra and optimization and.... 

*	 <http://scikit-learn.org/stable/> Scikit-learn: Scippy has a nubmer
of "kits" that add additional functionality. This one has a bunch of cool
machine learning algorithms with generally user-friendly APIs (so they are
more accessible to non-ML experts). Since machine learning is pretty hot
right now, and the idea of AI and computers learning though statistics is
just plain cool, even a brief foray into this area would be well received by
students (e.g., lots of classification algorithms boil down to a linear
model, albeit in a transformed space)

*	 <http://pandas.pydata.org/pandas-docs/stable/index.html> Pandas:
Major contribution is the DataFrame, which is meant to have similar
functionality to R's popular DataFrame. Has lots of nice data import/export
features too (e.g., Pandas.DataFrame.from_csv("filename.csv" creates a nice
data from right from a local csv)


*	 <http://matplotlib.org/> Matplotlib: Emulates a lot of MATLAB's
plotting functionality. again, with a generally user-friendly API.

*	 <http://stanford.edu/~mwaskom/software/seaborn/api.html> Seaborn:
This is a package that uses matplotlib behind the scenes, but it makes a lot
of the choices for you regarding formatting and display...generally good
choices ;-) I use it a lot because I don't like fiddling with tons of

*	(NOT FREE)  <http://www.dabeaz.com/per.html> Python Essential
Reference by David Beasley. This is a very concise (but well written)
reference manual on Python programming (note, does not have a statistics
focus). However, it does a good job pointing out the quirks in the language
and how it's internals work, so Python will seem less mysterious.

New(er) Data Formats

It may also be helpful for you to briefly describe how to use JSON and YAML
data formats. They aren't super difficult to learn, but both R and Python
can parse these files into useful data structures and they allow for
expressing more complex data (like nested lists). It also helps if your
students aren't tied to CSV files, useful as they may be for basic

*	 <http://www.w3schools.com/json/> JSON: Less "human readable" but
widely used.
*	 <http://ess.khhq.net/wiki/YAML_Tutorial> YAML: More readable and a
person favorite of mine for developing configuration files and expressing
complex data.

Finally: Done underestimate YouTube....lots of great stuff related to above,
and its generally easier to digest a 15 minute example.

As a practicing data scientist, I regularly use all the above items, and
they have helped me learn a lot of techniques.

Hope it helps you and your students.



*	If your students are going to work with R, they most definitely
should install  <https://www.rstudio.com/> RStudio, which is a great IDE
(dare I say "industry standard"?) for R.
*	Johns Hopkins offers a
<https://www.coursera.org/specializations/jhu-data-science> data science
specialization on Coursera. The specialization itself has a fee, but the
courses are free, they are based on R, and their done well. In particular,
the second course is an introduction to R programming.
*	The swirl package can be installed from CRAN. It is a learn-by-doing
approach to R and related topics. Once installed, it lets you choose from a
<https://github.com/swirldev/swirl_courses#swirl-courses> list of courses
and then walks you through entering and executing code. Someone shifting
from, say, Python to R might find it a tad basic, but for a beginner it's a
fairly painless introduction to R coding.
*	There's an active
<https://plus.google.com/communities/117681470673972651781> Statistics and R
Google+ community where people can seek help.



I could recommend some text books, but you have enough by now. Besides, it
would be useful to visit some interesting sites showing R aplplications.
here is a suggetsion:
ow/> R Stats + Digital Analytics: 8 Blogs you should Follow



Learning Base R,

by Lawrence M. Leemis,

2016, Lightning Source, ISBN: 978-0-9829174-8-0. 

Available on Amazon.


*Learning Base R* provides an introduction to the R language for those with
limited or no prior programming experience.  It introduces the key topics,
listed below, that are needed to begin analyzing data and programming in R.

The focus is on the R language rather than a particular application.  Nearly
200 exercises make the book appropriate for classroom use. 



You might want to take a look at R for Marketing Research and Analtyics
<http://r-marketing.r-forge.r-project.org/> . The first half of the book
focuses on basic statistical operations that ought to be fairly universal
(plotting, crosstabulating, ANOVA and linear regression).  The second half
covers a variety of more specific methods that are useful in marketing
including factor analysis, choice modeling and hierarchical modeling. It
wasn't intended as a textbook, but a few marketing faculty have adopted it.
They are creating slides and exercises to go with the book and should be
posting them in the next week or so.  You can read a review of the book in
the Journal of Statistical Software.











More information about the AISWorld mailing list