Classification of fraudulent transactions using bigquery ml

Classification is one of the most used methods for developing solutions in Artificial Intelligence-based systems. Some of the most popular applications of these algorithms include object recognition in images, handwritten text transcription and letter recognition, medical diagnosis, spam detection, sentiment analysis in natural language processing, user or customer classification in a market, customer churn prediction, and banking transaction detection.

Classification of fraudulent transactions using bigquery ml

Written by Adrian Militaru (Data Engineer) and Sandor Marton (Data Engineer), in the September 2023 issue of Today Software Magazine.

Read the article in Romanian here

Classification is one of the most used methods for developing solutions in Artificial Intelligence-based systems. Some of the most popular applications of these algorithms include object recognition in images, handwritten text transcription and letter recognition, medical diagnosis, spam detection, sentiment analysis in natural language processing, user or customer classification in a market, customer churn prediction, and banking transaction detection.

This article presents a suite of services from the Google Cloud Platform (GCP) for solving a binary classification problem: the detection of fraudulent transactions.

Depending on the outcome that solves the classification problem, we can talk about two categories: binary classification and multi-class classification. In this case, since we want to label transactions using only two labels: fraudulent and non-fraudulent, the problem becomes a binary classification one.

Starting with an initial set M of bank transactions about which we don’t know if they are fraudulent or not, we want to predict the membership of each of these transactions in one of the two disjoint subsets (SM1 and SM2) of the initial set M: the subset with fraudulent transactions or the one with regular operations.

Throughout this process, we will use BigQuery ML for training and evaluating the logistic regression model used, as well as GCP Dataflow for loading data into BigQuery and Looker Studio for visualizations that describe the prediction results.

BigQuery ML is an extension of the Google BigQuery service that allows users to create and train Machine Learning models directly within the BigQuery environment without transfering data to another platform or using a separate Machine Learning tool. We chose to use this service considering its numerous benefits, including ease of use without the need to transfer data to another location, as well as reduced development and implementation time.

Furthermore, given the flexibility of the BigQuery ML service, the Machine Learning aspect of the project we developed was not affected at all, considering the list of models to choose from and their easy parameterisation.

Being an easily integrable service and an extension of BigQuery, BigQuery ML uses SQL, making data processing tasks (calculating new properties based on existing ones, removing redundant data, etc.) easy to perform. Additionally, we created, evaluated, and trained the model using commands like SQL.

Google Cloud Dataflow is a powerful data processing service that works seamlessly with both batch and streaming data. It allows users to process and analyze large volumes of data in a scalable and efficient manner. Google Cloud Dataflow uses Apache Beam, enabling developers to write code to define data transformations and analytics. This service is ideal for tasks such as streaming processing, real-time data analysis, and ETL (Extract, Transform, Load).

With Google Cloud Dataflow, managing large volumes of data is much easier, thanks to its integration with other GCP services, such as BigQuery or Storage, to build advanced data analysis solutions and extract statistics.

Logistic regression is a statistical analysis technique and a Machine Learning algorithm used to predict the probability of an event occurring (such as classifying an object into one of two categories).

TSM_September_Article2_Fig1.png

Enumeration of Steps in the Presented Process – In the white cells, we have described the specific activity of each step, and in the gray cell beneath each step, we have specified the programming language (e.g., Scala) or the Google Cloud service used (Dataflow, BigQuery ML, or Looker Studio).

In Figure 1, we have visually described the sequence of steps we followed to complete this project.

Since we didn’t use a dataset provided by a bank, we generated our own dataset with a structure like that of a real bank transaction.

Subsequently, we divided this dataset into three disjoint subsets, which we used in the three different stages of the logistic regression model: training, validation, and, finally, testing.

Using a dataflow job, we loaded the data into BigQuery, where we prepared it to be as relevant as possible for the classification model (we added new columns based on those already generated and removed redundant information).

We used a dataset containing 300,000 bank transactions, which we divided into 80% for model training and 20% for validation.

We repeated experiments with different logistic regression models, using various parameters until we obtained satisfactory results.

Afterwards, we applied the selected model to the last dataset, the test dataset, which was the same size as the one used for training and validation: 300,000 transactions.

For this study, we generated synthetic data using an in-house script designed to mimic transactional data.

The synthetic data adheres to a predefined schema closely resembling real transaction records. The schema includes the following key fields:

  • transactional_id: a positive integer serving as a unique identifier for each transaction.

  • timestamp: millisecond-precise timestamps, starting from September.

  • customer_id: randomly generated integers between 0 and 999 to represent unique customer identifiers.

  • transaction: a nested field containing the following subfields:

    1. city: an enumeration with possible values: Cluj, Oradea, and Timișoara, representing the transaction’s location.

    2. type: another enumeration with possible values: purchase, withdrawal, or transfer. There is a 1% probability of obtaining a null value for this field.

    3. amount: a bigdecimal number uniformly distributed between 0 and 100. There is a 5% probability of obtaining a null value for this field.

{

"customer id": 1112,

"timestamp": "2023-09-16T16:37:42.958Z",

"transaction": {

"amount": 65.9657471694887, "city": "Cluj",

"type": "transfer"

},

"transaction id": 130904

},

{

"customer id": 1232,

"timestamp": "2023-09-16T16:10:11.728Z",

"transaction": {

"amount": 76.3690012480496,

"city": "Timisoara",

"type": "transfer"

},

"transaction id": 6180953

},

{

"customer_id": 1245,

"timestamp": "2023-09-17T15:26:41.710Z",

"transaction": {

"amount": null,

"city": "Cluj",

"type": "withdrawal"

},

"transaction id": 9472075

}

In total, a dataset containing 300,000 synthetic transactions in JSON format was generated. These transactions serve as training data for our Machine Learning model, allowing it to learn indicative patterns of suspicious behaviour.

Additionally, we generated another set of 30,000 synthetic transactions that adhere to the same schema to evaluate the model’s performance. These transactions will be used to assess the model’s ability to identify suspicious transactions correctly.

Once we obtained the synthetic data, we stored the files in Google Cloud Storage and explored options for ingesting them into Google BigQuery. We chose to use Google Dataflow for its ease of use: Dataflow provides predefined templates, one of which is designed for data ingestion from Google Cloud Storage into BigQuery.

In this case, we specified the input file path, the target table, its schema, and a JavaScript file containing a transformation applied to the input data before writing it to BigQuery.

We processed the data in three steps, during which we calculated new properties based on those generated in the synthetic dataset.

By using the transaction’s date and time, as well as the customer’s ID, we labeled a transaction as fraudulent in any of the four situations outlined in the following table:

1.

The same customer made two or more transactions in less than one minute.

2.

A customer had at least two invalid transactions associated with their account within 4 hours.

3.

A customer made the same amount of transactions more than twice within a 4-hour interval.

4.

The customer had more than 10 transactions within 4 hours.

Table 1. Created by the authors: Cases when a transaction was marked as fraudulent 

In this manner, we labelLed over 13,000 out of the 300,000 transactions as fraudulent. Subsequently, we removed redundant properties from the dataset, such as the transaction ID, which does not assist the Machine Learning algorithm classify transactions.

Using BigQuery ML, we created a model to which we provided 80% of the 300,000 transactions, which amounts to 240,000 transactions, constituting the training dataset.

The parameters we specified when creating the model include the model type (in our case, linear regression), the column containing the labels representing the prediction outcome (in our table, the “suspicious” column), and the method for splitting the data into training and validation sets (we opted for a random split).

We experimented with various other parameters, but the results were either much worse or too good.

Although some other models showed better metrics, we didn’t choose them because the model exhibited the phenomenon known in the field as overfitting, meaning it learned the training data too well but was at risk of making weaker predictions on a new dataset.

AUTO_CLASS_WEIGHTS is an example of a parameter we abandoned. Initially, we included it in experiments to balance the number of fraudulent transactions with those that do not show any attempt at fraud (considering that only 13,000 out of the 300,000 transactions are fraudulent). However, we stopped using it when we observed that it had a negative impact on the considered metrics (precision, accuracy).

OPTIONS

(model_type='logistic_reg',

input_label_cols=['suspicious'],

DATA_SPLIT_METHOD = 'RANDOM',

DATA_SPLIT_EVAL_FRACTION = 0.80)

In addition to the actual creation of the model (which will be stored in BigQuery like any other well-known entity such as a table or view), BigQuery ML provides, just like at the end of the execution of any command, a set of graphically represented statistics that describe the execution details and the execution timeline.

The execution details include the duration of each training iteration as well as the evolution of loss functions and learning rates across these iterations. The execution timeline illustrates the time required for each step of the model creation command: preprocessing, training, and evaluation.

TSM_September_Article2_Fig2.png32TSM_Visuals_Septmeber.png

Model evaluation is just as intuitive and straightforward as the training part, as it is performed through an SQL-like command.

Evaluation results are presented in tabular format, with each metric detailed in a column. In our case, the results are very good, both on the validation dataset and the test dataset. The metrics we considered can be observed in Fig. 4.

TSM_Visuals_Septmeber.png

Since the number of non-fraudulent transactions is much higher than that of fraudulent ones (in all datasets, both in the training and test sets), good accuracy doesn’t necessarily indicate good precision because classifying a large number of transactions as non-fraudulent significantly inflates the accuracy percentage. Therefore, in evaluating the model, we placed particular importance on precision, not just accuracy.

Precision is crucial in contexts where the costs of incorrect predictions are high, such as in the medical or financial domain. In our test dataset, the model achieves an accuracy of 94% and a precision of 68.5%.

For obtaining predictions, the model is easy to use, requiring only the selection of data with a query that uses the previously created model as a parameter for the ML.PREDICT function.

For visualising the results, we chose to use another service from the Google Cloud platform, Looker Studio.

Looker Studio is a powerful data analysis and visualisation tool that supports the rapid creation of reports, interactive dashboards, and easily understandable graphical visualisations.

For more complex scenarios, the platform offers teams the opportunity to collaborate on the same documents to discover trends, make more informed decisions, and gain a competitive advantage from their data, thereby contributing to the digital transformation of businesses.

In this case, we used Looker Studio directly from BigQuery, seamlessly integrating the two services.

2TSM_Visuals_Septmeber.pngMicrosoftTeams-image-91.png

Figure 6.1 and Figure 6.2 describe the algorithm’s performance on the test dataset. The table presents the actual classification of transactions (actual_suspicious) compared to the classification made by the model (predicted_suspicious).

Both labels can take two values: true when the transaction is classified as fraudulent and false when it is not labelled as fraudulent.

Analysing Figure 6.2, we can observe that the model correctly labels most of the transactions (given the large dark blue area in the first row and the large light blue area in the second row), but it tends to classify more transactions as fraudulent than in reality.

With further improvement of the model, we can present a model with better results, but currently, this is not a significant issue, as we prefer to have transactions labelled as fraudulent when they are not rather than the other way around (the model incorrectly labelling some fraudulent transactions as normal).

In conclusion, this article highlights the ease of integrating three services available on the Google Cloud platform through a concrete example: the classification of bank transactions.

Alongside seamless integration, we recommend DataFlow, BigQuery ML, and Looker Studio for many other reasons, such as the large number of templates users can choose from, the numerous configurations developers can apply to processes, data, and figures so that, regardless of the project’s application domain, the quality of results is not compromised.