This month, I decided to try out a training module on data analysis with Python. While, overall, I think I may have a more conservative outlook on what “Big Data” can help us achieve than some of my peers in the School of Information, there’s no doubt that being conversant with data is becoming increasingly important for a diverse range of fields and tasks. By conversant with data, I mean being able to effectively retrieve, handle, manage, read, describe, understand, repair, process, transform, analyze, and store it, among other things. Being data-fluent is fast becoming an expectation for librarians, and will especially be necessary if I go on to work with research data as a career, as I am currently planning. Additionally, and most germane to the goals of the Shapiro Design Lab, handling and working with volunteer-collected data in varied forms is a key task (and some might say obstacle) of citizen science projects, particularly large scale ones. While a two-hour training module is not going to make anyone an expert on working with data, it’s not a bad place to get an introduction.
When not working with the truly big data sets, from what I can tell, two main schools of data science seem to be dominant: the people who primarily use R and the people who prefer python using the numpy and pandas packages. Both software suites offer powerful tools for doing the set of tasks commonly referred to as “data science,” and each has its own set of relative pros and cons. I’m an open source partisan, of course, but that didn’t help me here, since both of these software frameworks are open source. I decided to try out learning some about numpy and pandas, since I have experimented with learning R elsewhere.
Numpy is a package for python that greatly extends its capability to do scientific computing. Basic python is not an effective tool for doing some computing tasks because of the way its array objects are implemented: python lists are one-dimensional dynamic arrays, which, to my knowledge, collect labels or addresses of other objects which could be in different places in the memory. Numpy offers n-dimensional arrays that occupy contiguous regions of memory, making them more rigid than python lists but much faster and more powerful for processing collections of data of a single type. There are almost certainly further optimizations included in the package, but this module did not go that far into the technical details. This way of handling arrays also allows use of highly efficient computation libraries coded in languages like FORTRAN and C. Along with support for n-dimensional arrays, numpy also comes with a set of mathematical operations that apply to whole arrays, rows or columns. Navigating the array objects involves a generalization of python’s sequence indexing/slicing syntax, which is sensible though somewhat hard to get used to.
The example problem in the module was looking at temperature data from the NOAA. Getting the data from the file into an array was a challenge, and it was at this point the presenter went a bit too fast for me to follow along. It seems like this is the most difficult part of the process, at least if the file is not formatted in such a way that the import is easy to achieve. Once in an array, we could do useful operations like interpolation for missing data (important!) and smoothing (e.g. making each point the average of the previous 10 points and next 10 points) to help with noisy data.
The other topic the module covered is a data science package for python called pandas. Pandas builds on numpy to make working with data in arrays more flexible and easier to follow. It does this by implementing what are called “data frames” which offer a few new features. The most immediately obvious is that rows and columns can have and be displayed with labels, which makes tabular representation much more friendly. Another feature is that these data frames are size-mutable (or, at least, appear to be on the user’s end) in a way that the basic numpy arrays are not. Third, and probably most importantly, these data frames allow for a functionality called “stacking” which allows data to be aggregated and manipulated in different ways depending on how frames are “stacked.” These operations were unintuitive (to me, at least) but powerful. Pandas data frames are something I definitely want to try to learn in a more thorough way than 45 minutes of an online training module can provide, though. One particularly helpful technique I got out of this part was something the presenter called “boolean masks,” where you can check a condition on an entire array and get back an array of Trues and Falses, then use that array to select only the entries of the original array that satisfied that condition. This seems like a very powerful tool to use when analyzing data sets.
Overall, I thought this was an interesting and pretty informative module, given the challenge that data analysis is far too large a topic to do anything but touch on introductory points in a 2.25 hour time span. I don’t feel like I’m ready to go out there and start tackling real data sets yet, but I am interested in looking into pandas more deeply. It’s also worth noting that pandas is a very new package: the most recent official release is version 0.23.4 as of August. If it’s already this popular and powerful this early on, it’s definitely worth keeping up with.