Introduction to Data Analysis and the Numpy Package

We will begin by giving a high-level overview of the kinds of data analysis tasks we will be performing and why they are important. We’ll also introduce the numpy python package and work through some examples in a Jupyter notebook.

By the end of this module, students should be able to:

  • Understand what data analysis is and why (at a high level) it is important.

  • Have a basic understanding of the different kinds of tasks we will perform and what libraries we will use for each kind of task.

  • (Numpy) Understand the primary differences between the ndarray object from numpy and basic Python lists, and when to use each.

  • (Numpy) Utilize ndarray objects to perform various computations, including linear algebra calculations and statistical operations.

Data Analysis and Manipulation

In this class, as with problems in the real world, we will be working with datasets. When we introduce Machine Learning in a couple of weeks, you will see that our algorithms and workflows require data – high quality data – in order to be effective.

Datasets come in different shapes and sizes:

  1. Text data, in different formats and types; for example, JSON, CSV, XML, SQL, etc.

  2. Image data, in different formats and sizes; for example, JPG, PNG, BMP, TIFF, etc.

  3. Audio data, in different formats and sizes; for example, M4A, MP3, MP4, WAV, FLAC, etc.

  4. Video data, in different formats and sizes; for example, MVV, MOV, MPEG4, etc.

Additionally, sometimes different types of data are combined – for example, a set of images or audio files together with some labels in text format.

While data naturally appear in all of these formats above, most of the algorithms we will be studying work on numerical data; i.e., integer and float types. We will need to prepare the data prior to using these algorithms in order for them to be effective.

Moreover, it is typical for there to be other issues with the data. Some examples include:

  1. (Text Data) Null or missing values: with tabular text data, some rows may not contain values for every column.

  2. (Text Data) Duplicate data: rows or columns could be duplicated.

  3. (Text Data) Irrelevant data: some columns could be irrelevant to the process being modeled.

  4. (Image and Video Data) The samples could be of different dimensions or different resolutions.

  5. (Audio, Image and Video Data) Samples could be processed with different codecs (compression and/or decompression) and there could be noise and/or distortion on the samples.

A major aspect of the work involved in applying machine learning to real-world applications involves data preprocessing; that is, analyzing the raw data to understand what you have, and transforming it into a format that can be efficiently utilized by the learning algorithms.

Without taking the necessary steps to prepare the data, the machine learning algorithms will not be effective.

Note

“Garbage in, garbage out.” -An astute COE 379L student

Big Picture: Data Analysis Tasks and the Libraries we will use

Data analysis consists of several different kinds of tasks. We will be using Python libraries that specialize in each task type.

  • Numpy and Scipy – These libraries form the foundation of all of the libraries we will use. The numpy package, which we will look at first, provides an efficient multi-dimensional array object, useful for performing numeric computations across vectors and matrices. The scipy pacakge, which depends on the ndarray from numpy, adds a number of scientific algorithms, such as algorithms from linear algebra (e.g., matrix calculations), numerical methods (e.g., integration), FFTs, etc.

  • Pandas – The pandas library provides an efficient Series and DataFrame class for working with column/tabular data. We will make extensive use of Pandas throughout the course for data preprocessing tasks.

  • Matplotlib/Seaborn – These are two visualization libraries, for creating plots of various kinds. We will use these libraries extensively as well, to be able to visualize the datasets we are working with.

  • Scikit-Learn – This is the library we will use during the first (roughly) half of the semester, for the “classical” machine learning algorithms we will employ.

  • Tensorflow – This is the library we will use for neural networks and deep learning.

The Python libraries we will use and their relationships.

The Python libraries we will use and their relationships.

Numpy

In this module, we will introduce the Python library numpy for working with arrays of numerical data.

In some ways, numpy is perhaps the package we will use the least directly, but since all the other libraries depend on its ndarray object, it will be useful to have a basic exposure to it. We will, on occasion, use numpy functions directly on our data.

The Numpy Package

The numpy package provides a Python library for working with numerical arrays that are orders of magnitute faster than ordinary Python lists. The primary data structure provided by numpy is the ndarray. There are a few main reasons why working with ndarrays is faster than normal Python lists for numerical calculations:

  1. Storage in memory: Numpy ndarrays are stored as continuous memory, unlike Python lists which are stored across the heap. Various algorithms can exploit this continuity to achieve significant performance gains.

  2. The performance-critical blocks of numpy are written in C/C++ and are optimally compiled for different CPU architectures.

Installing Numpy

The numpy package is available from the Python Package Index (PyPI) and can be installed on most platforms using a Python package mananger such as pip:

[container/virtualenv]$ pip install numpy

Warning

I highly recommend you avoid installing these packages directly into the global package namespace (i.e., executing pip install numpy directly on the VM).

Over time, you are likely to run into dependency issues and will have a hard time modifying and/or reproducing your environment in another location.

Once installed, we can import the numpy package; it is customary to import the top level package as np, i.e.,

>>> import numpy as np

Using the Class Docker Container

We have created a Docker image available on the public Docker Hub (hub.docker.com)

Note

The class image is jstubbs/coe379l. Use either the default (latest) tag or the :fa25 tag.

The docker image contains all of the libraries that we will need for the course, including numpy and jupyter.

You can see a list of all of the packages installed in the poetry.lock file on the class repo. (and by the way, if you don’t know about Python Poetry, check it out!)

Numpy Arrays

The workhouse of numpy is the ndarray class. Arrays are collections of data of the same type.

Creating Arrays from lists

We can create an array in numpy from a list of integers using the np.array() function, as follows:

>>> m = np.array([1,2,3,4,5])

Numpy arrays have both a size and a shape:

>>> m.size
5

>> m.shape
(5,)

The size returns the total number of elements in the array while the shape returns the size of each dimension of the array. The array we defined above was a 1-dimensional array (or a “1-d array”). Numpy supports creating arrays of different dimensions. For example, we can create a 2-d or a 3-d array by passing additional lists to the np.array() function:

# 2-d array
>>> m2 = np.array([[1,2,3,4,5], [6,7,8,9,10]])
>>> m2.size
10
>>> m2.shape
(2,5)

# 3-d array
>>> m3 = np.array([ [[1, 2], [3, 4], [5, 6]], [[-1, -2], [-3, -4], [-5, -6]]] )
>>> m3.size
12
>>> m3.shape
(2, 3, 2)

The shape of m2 is (2,5) indicating that it has 2 rows of 5 elements each. Similarly, the shape of m3 is (2, 3, 2) because it has 2 rows, 3 columns and 2 “depth” dimensions.

Another way to think of it is this: a 2d-array is an array that has 1d-arrays as its elements. Similarly, a 3d-array is an array with 2d-arrays as its elements, etc.

Warning

Take care to note the use of open ([) and closed (]) brackets. Ultimately, the np.array() function takes one positional argument, which is the the list (array) of objects (elements, 1d-arrays, 2d-arrays, etc.)

If we get confused, we can always ask numpy for the dimension of an array:

>>> m3.ndim
3

Note that each row of an ndarray must have the same number of elements; the following does not work:

>>> m = np.array([[1,2,3], [6,7]])

What happens if you try the code above?

Exercise. Using the np.array function, create a 3d-array with 2 rows, 3 columns and 4 depth dimensions. Put a 1 in the first entry, and increase each susequent entry by 1 so that the values in the array are 1, 2, 3, 4, 5, … What is the largest value in the array?

Similarly, we can create an array of random numbers, though we will need to import the random package from numpy. Here we create an array of random integers over a specific range:

>>> from numpy import random
# create a 3x4 array of random integers between 0 and 100
>>> m = np.random.randint(100, size=(3, 4))
>>> m
array([[22, 33, 35, 66],
      [41, 84, 25, 89],
      [23, 99, 94,  3]])

Note that the value of the size parameter is a tuple with the sizes of each dimension.

References and Additional Resources

  1. Numpy documentation – Numpy v1.26 manual.