Blogs
Article by Yutika Rege (MLE-2 @Pyrack) 9 min read

The Craft of Extracting Mathematical Insights from Allied Columns

Mathematical equations rule the world

Mathematics is a quintessential part of our lives, woven deeply into the root of our existence. From the rhythmic patterns of nature to the precision of technological innovations, the language of mathematics speaks universally, shaping the way we understand and navigate the world around us.

Beyond its role in academia, mathematics is a guiding compass in decision-making. It goes without saying that mathematics helps unveiling patterns that a human eye can easily overlook.

This week, we shall discuss how the synergy of mathematics and machine learning can help unearth correlations between multiple columns in a dataset and subsequently produce mathematical equations based off of these correlations.

How does one acquire these equations programmatically?

In the realm of data science, dealing with datasets that harbour thousands of columns and millions of rows is a common scenario. Skimming through such colossal volumes of data can be a real challenge, let alone extracting meaningful insights pertaining to the correlated columns that could potentially yield meaningful mathematical equations.

So then, how do we tackle this challenge? The answer lies in the power of Linear Algebra backed by Python's very own Scikit-Learn library!

Pyrack’s approach towards deriving equations and how you can try it too

Step-1: Data preparation

For this exercise we shall be using a dummy dataset with the following columns:

Fig-1: Snippet of the dataset

The source of this dataset shall remain undisclosed and was acquired in the form of a csv. This csv file was then read into the python environment as a pandas dataframe, for ease of exploratory and statistical analyses. The dataset shown above is a subset of the dataframe and the dimensions of this subset are 10,00,000 rows and 4 columns.

Step-2: Domain knowledge acquisition

In this particular use-case, the main goal was to identify the columns that belong to the numeric data type, thereby discovering patterns and correlations among the columns that belong to this particular data type.

To the average human being, it is impossible to come up with equations just by looking at the dataset. So, we just let the machine intelligently establish relationships among these columns.

Note: Creating subsets of the dataframe significantly reduces computation time and was done keeping in mind the time-space complexity of machine learning algorithms.

Step-3: Devising an algorithm for correlation analysis and building mathematical equations

Fig-2: our function called “equationBuilder”

A pseudo-algorithm based explanation of the function:

Data Cleaning:

●       Takes a DataFrame (df) as input.

●       Drops any rows containing NaN (Not a Number) values from the DataFrame.

Correlation Calculation:

●       Computes the correlation matrix for the remaining data in the DataFrame (df).

●       Rounds the correlation matrix values to one decimal place.

Identifying Correlated Columns:

●       Iterates through the columns of the correlation matrix and identifies columns that have a non-zero correlation with each other.

●       Creates a dictionary (correlation_dict) where each column is a key, and the corresponding value is a list of columns that are correlated with it.

Linear Regression and Equation Building:

●       For each key-value pair in correlation_dict, it performs linear regression.

●       The dependent variable (y) is the column specified by the key, and the independent variables (X) are the columns specified in the corresponding value list.

●       Builds a linear regression model using scikit-learn's LinearRegression.

●       Constructs a linear equation based on the regression coefficients and intercept.

●       The coefficients determine the negative/positive relationship between two variables

The fundamental premise of linear regression is to model this relationship as a linear equation of the form:

Y = β0 + β1X1 + β2X2 + ...+ βnXn + ɛ

Here, 'Y' represents the dependent variable, 'X1,' 'X2,' ..., 'Xn' are the independent variables, 'β0' is the intercept, 'β1,' 'β2,' ..., 'βn' are the coefficients corresponding to each independent variable, and 'ɛ' denotes the error term. The objective is to estimate the coefficients 'β0,' 'β1,' 'β2,' ..., 'βn' that best captures the linear relationship between the variables.

Rule Generation:

●       Constructs rules for each equation, both in a technical format and a more readable format.

●       Stores the rules in a dictionary (temp_rules), where the keys are 'Technical_Rule' and 'Readable_Rule'.

●       The rules are then stored in result_dict with the dependent variable as the key.

Fig-3: The output of the algorithm on the dataset

Note: It can be seen that none of the columns seem to factor the column “A” anywhere in the equation because it shares zero correlation with the other columns, rendering the importance of the column meaningless in the whole equation.

Step-4: Analysing a relevant equation

Fig-4: equation for the variable ‘D’

 

The equation above could be read as:

Or in short:

To confirm the validity of the equation, we plug in values from the actual dataframe and  arrange them according the equation -

Fig-5: Sample input

The output of the corresponding data frame row for the column ‘total_outstanding_amt’  -

Fig-6: Output of the dataframe row

 

It can be seen that the output of the dataframe row in Fig-6 is approximately equal to the output of the sample input as seen in Fig-5.

Conclusion

In essence, the fusion of mathematics and machine learning, as exemplified by Pyrack's methodology, provides a systematic and efficient way to distil meaningful mathematical insights from large datasets. By leveraging linear algebra and Python's Scikit-Learn library, the process involves data preparation, domain-specific knowledge acquisition, and the development of an algorithm for correlation analysis. The resulting equations, validated through real data, offer a deeper understanding of relationships between dataset columns. This approach not only automates the extraction of mathematical insights but also demonstrates practical applications. The systematic synergy of mathematics, machine learning, and programming emerges as a potent tool for uncovering hidden patterns and facilitating data-driven decision-making.

Scope

1. Enhanced Modelling Techniques: To address the limitations, future iterations of this methodology could explore more advanced modelling techniques, such as non-linear regression or machine learning algorithms capable of handling multiplicative and polynomial equations. Incorporating these techniques would broaden the scope, allowing for a more comprehensive analysis of complex relationships within datasets. For instance, considering scenarios like rate of interest affecting the final amount in financial datasets, polynomial relations and regression could be integrated to capture the intricacies of such relationships.

2. Expanded Domain Applicability: While the current focus is on a specific data type, expanding the methodology to accommodate diverse data types and industries would enhance its versatility. Tailoring the approach to different domains could open up opportunities for a wider range of applications, making it a more adaptable tool for various analytical challenges, including those involving polynomial relationships.

3. Integration of Feature Engineering: Incorporating feature engineering methods could contribute to a more nuanced understanding of the data. By creating new features or transforming existing ones, the methodology could better capture intricate relationships, improving its ability to uncover hidden patterns and dependencies. This is particularly relevant when dealing with datasets exhibiting polynomial trends, where feature engineering can enhance the model's capacity to discern and interpret complex patterns.

4. Validation on Non-Linear Datasets: To validate the scope, the methodology should be tested on datasets known for their non-linear relationships. This would provide insights into its performance and identify areas for improvement, ensuring its applicability across a spectrum of data types and structures, including those involving polynomial relations and non-linear dependencies.

AI
Machine Learning
Linear Regression
Mathematics
Maths
Coefficients