What is data wrangling? Mention three points to consider in the process.
Data wrangling is a process by which we convert and map data. This changes data from its raw form to a format that is a lot more valuable.
Data wrangling is the first step for machine learning and deep learning. The end goal is to provide data that is actionable and to provide it as fast as possible.
There are three major things to focus on while talking about data wrangling –
1. Acquiring data
The first and probably the most important step in data science is the acquiring, sorting and cleaning of data. This is an extremely tedious process and requires the most amount of time.
One needs to:
- Check if the data is valid and up-to-date.
- Check if the data acquired is relevant for the problem at hand.
2. Data cleaning
Data cleaning is an essential component of data wrangling and requires a lot of patience. To make the job easier it is first essential to format the data make the data readable for humans at first.
The essentials involved are:
- Format the data to make it more readable
- Find outliers (data points that do not match the rest of the dataset) in data
- Find missing values and remove them from the data set (without this, any model being trained becomes incomplete and useless)
3. Data Computation
At times, your machine not have enough resources to run your algorithm e.g. you might not have a GPU. In these cases, you can use publicly available APIs to run your algorithm. These are standard end points found on the web which allow you to use computing power over the web and process data without having to rely on your own system. An example would be the Google Colab Platform.
Why is normalisation required before applying any machine learning model? What module can you use to perform normalisation?
Normalisation is a process that is required when an algorithm uses something like distance measures. Examples would be clustering data, finding cosine similarities, creating recommender systems.
Normalisation is not always required and is done to prevent variables that are on higher scale from affecting outcomes that are on lower levels. For example, consider a dataset of employees’ income. This data won’t be on the same scale if you try to cluster it. Hence, we would have to normalise the data to prevent incorrect clustering.
A key point to note is that normalisation does not distort the differences in the range of values.
A problem we might face if we don’t normalise data is that gradients would take a very long time to descend and reach the global maxima/ minima.
For numerical data, normalisation is generally done between the range of 0 to 1.
The general formula is:
Xnew = (x-xmin)/(xmax-xmin)
Performing Normalisation in Python
In python, this can be easily done with the scikit-learn module (this can be installed through a pip command or installed from anaconda).
What is a sobel filter? How would you implement it in Python?
The sobel filter performs a two-dimensional spatial gradient measurement on a given image which then emphasizes regions which have high spatial frequency. In effect, this means finding edges.
In most cases, sobel filters are used to find the approximate absolute gradient magnitude for every point in a grayscale image. The operator consists of a pair of 3×3 convolution kernels. One of these kernels is rotated by 90 degrees.
These kernels respond to edges that run horizontal or vertical with respect to the pixel grid, one kernel for each orientation. A point to note is that these kernels can be applied either separately or can be combined together to find the absolute magnitude of the gradient at every point.
The sobel operator has a large convolution kernel which ends up smoothing the image to a greater extent and thus the operator becomes less sensitive to noise. It also produces higher output values for similar edges compared to other methods.
To overcome the problem of output values from the operator overflowing the maximum allowed pixel value per image type, avoid using image types that support pixel values.
Implementation in Python
To implement it in Python, we can use the OpenCV module (can be installed from pip):
What is the curse of dimensionality?
The curse of dimensionality states that if the number of features is very large relative to the number of observations in a certain data set, many algorithms will fail to be able to train an effective model. This is extremely relevant to many of the commonly used algorithms, especially those that rely on distance measures.
What is the difference between feature selection and feature extraction?
Feature selection and feature extraction are two major ways of fixing the curse of dimensionality
1. Feature selection:
Feature selection is used to filter a subset of input variables on which the attention should focus. Every other variable is ignored. This is something which we, as humans, tend to do subconsciously.
Many domains have tens of thousands of variables out of which most are irrelevant and redundant. Feature selection limits the training data and reduces the amount of computational resources used. It can significantly improve a learning algorithms performance.
In summary, we can say that the goal of feature selection is to find out an optimal feature subset. This might not be entirely accurate, however, methods of understanding the importance of features also exist. Some modules in python such as xgboost help achieve the same.
2. Feature extraction
Feature extraction involves transformation of features so that we can extract features to improve the process of feature selection. For example, in an unsupervised learning problem, the extraction of bigrams from a text, or the extraction of contours from an image are examples of feature extraction.
The general workflow involves applying feature extraction on given data to extract features and then apply feature selection with respect to the target variable to select a subset of data. In effect, this helps improve the accuracy of a model.
What is Support Vector Regression?
Support Vector Machines (SVM) are used for regression. SVMs not only maintain the features of an algorithm but they can also be modified slightly to perform regression. Since the output is a real number, it can become difficult to predict information due to infinite number of possibilities for regions. Hence, in the case of regression, a margin tolerance needs to be set and is an approximation to SVMs.
The main idea behind Support Vector Regression stays the same as SVMs i.e. to minimise error and individualise the hyperplane. It also maximises the margin. Form of error tolerance also plays a key role here.
In simple regression, we try to minimise the error rate. On the other hand, in support vector regression, we try to fit the error within a certain threshold.
What is sklearn? Why and when would you need to use it?
sklearn or scikit-learn is a python library which allows using a huge range of supervised and unsupervised learning algorithms.
The prerequisites for installing sklearn are:
- Scipy which is a library for scientific computing in python
- Matplotlib which allows plotting of 2d or 3d figures
- Numpy which allows for n dimensional array calculations
- Ipython which is an interactive console for python
- Pandas which is used for data analysis and working with data
- Sympy which is used for symbolic mathematics
Most of the features of scikit-learn fall under the following categories:
- Supervised Learning – All of the common types of supervised learning models can be implemented in scikit-learn. Examples include svms, decision trees etc.
- Feature extraction methods which help to define attributes in text and images.
- Clustering techniques such as k-medians.
- Dimensionality reduction techniques which help to reduce attributes from high dimensional spaces.
- Feature selection which helps in choosing attributes and in turn, reduces computation.
- Manifold learning which helps to summarise and depict complex multidimensional models.
- Scikit-learn also provides a model for ensemble learning methods which are combinations of multiple supervised models.
- Parameter tuning is also provided which allows for maximum efficiency of any algorithm.
How would you perform thresholding and canny detection using opencv?
Thresholding is one of the simplest methods which can be used for image segmentation. Its input is a grayscale image which can then be used to create binary images.
The most common use case of thresholding would to extract a specific colour from an image. Mathematically, thresholding replaces each pixel value in an image with black if it is less than a particular constant i.e. the required color or replaces it with white if it is greater than the constant.
Edge detection is an important principle in image processing. Canny edge detection is one of the most popular algorithms for edge detection.
The algorithm works as follows:
- Reduce noise in the image
- Find the intensity gradient
- Perform non-maximum suppression
- Perform thresholding.
When would you use ARIMA?
ARIMA is a widely used statistical method which stands for Auto Regressive Integrated Moving Average. It is generally used for analysing time series data and time series forecasting. Let’s take a quick look at the terms involved.
Auto Regression is a model that uses the relationship between the observation and some numbers of lagging observations.
Integrated means use of differences in raw observations which help make the time series stationary.
Moving Averages is a model that uses the relationship and dependency between the observation and residual error from the models being applied to the lagging observations.
Note that each of these components are used as parameters. After the construction of the model, a linear regression model is constructed.
Data is prepared by:
- Finding out the differences
- Removing trends and structures that will negatively affect the model
- Finally, making the model stationary.
Why is polarity and subjectivity an issue?
Polarity and subjectivity are terms which are generally used in sentiment analysis.
Polarity is the variation of emotions in a sentence. Since sentiment analysis is widely dependent on emotions and their intensity, polarity turns out to be an extremely important factor.
In most cases, opinions and sentiment analysis are evaluations. They fall under the categories of emotional and rational evaluations.
Rational evaluations, as the name suggests, are based on facts and rationality while emotional evaluations are based on non-tangible responses, which are not always easy to detect.
Subjectivity in sentiment analysis, is a matter of personal feelings and beliefs which may or may not be based on any fact. When there is a lot of subjectivity in a text, it must be explained and analysed in context. On the contrary, if there was a lot of polarity in the text, it could be expressed as a positive, negative or neutral emotion.
Where is the confusion matrix used? Which module would you use to show it?
In machine learning, confusion matrix is one of the easiest ways to summarise the performance of your algorithm.
At times, it is difficult to judge the accuracy of a model by just looking at the accuracy because of problems like unequal distribution. So, a better way to check how good your model is, is to use a confusion matrix.
First, let’s look at some key terms.
Classification accuracy – This is the ratio of the number of correct predictions to the number of predictions made
True positives – Correct predictions of true events
False positives – Incorrect predictions of true events
True negatives – Correct predictions of false events
False negatives – Incorrect predictions of false events.
The confusion matrix is now simply a matrix containing true positives, false positives, true negatives, false negatives.
Let’s take the Iris Flower dataset as an example.
The confusion matrix using a Linear Support Vector Classifier with parameter(C) as 0.01 is as follows:
Confusion matrix, without normalization
[[13 0 0]
[ 0 10 6]
[ 0 0 9]]
Normalized confusion matrix
[[1.0 0.0 0.0 ]
[0.0 0.62 0.38]
[0.0 0.0 1.0]]
Implementing it in python is extremely easy: