Weekly Reflections

WEEK 1 REFLECTION (2/10/2026): AI AND ORIGINALITY

AI and Originality
- The introduction of Artificial Intelligence started with the release of ChatGBT in November 2022, following other AI model releases such as Google Gemini, Claude, etc. This changed the world especially for data analysts and researchers, being used for research and coding assistance. Questions speculate whether AI can be used to brainstorm and assist with original ideas, or whether it generates the ideas for users.
- I argue that AI isn’t a tool to use for generating original ideas; rather that it draws and mixed original and previous research. I also argue that any ideas are completely “original” nowadays, as most of these designs have been publicized before, even though individuals proceed to reinvent and use them in new contexts.
AGI
- Google AI defines Artificial General Intelligence (AGI) as a type of technology used for theoretical ways of understanding, learning and performing any intellectual tasks that a human can accomplish. Furthermore, it is programmed to be a research agent, helping to break through scientific research and other disciplines.
- I believe that it can be used as a research assistant to produce the best research qualities, but I also argue that it can be used incorrectly, where the model can be prompted to do the research for the user. It may not be completely accurate, but it can be used in an academic dishonest way of conducting research and analyses.

WEEK 2 REFLECTION (2/17/2026): AI AND RESEARCH PROJECT

AI for Knowledge Mining Project
- For the group project, me and my colleagues believe we can use AI in our project by helping turn our STATA data zip file to help us use it in RStudio. AI can help us let the data be read in RStudio.
AI Model Selections and Research Applications
- We are using ChatGPT and Google Gemini for my project. Some of our colleagues are new using RStudio, so they will ask it to help them if I run into any errors uploading the STATA zip file to RStudio. Additionally, I will help my colleagues with the RStudio and Google Gemini codes to extract our results.

WEEK 3 REFLECTION (2/24/2026): HUMAN KNOWLEDGE, INFORMATION, AND AI DISCREPANCIES/FAILURES

Human Knowledge vs Information
- Individuals find and build knowledge based on their experiences and society they grew up in. According to Google AI, key methods of human knowledge are based on logical reasoning, memory, and receiving information from authorities, books, or education.
- Knowledge is the capability of a person’s understanding regarding a topic, while information is data a person contains. Information can be transferred, such as an email with a list of documents; it is always kept. Knowledge on the other hand, requires the mental and logical capability to review and understand the data provided. Regarding the email example: a person may keep the email with the listed documents, but the person needs to look at the documents and understand them, which the person may forget.
AI Discrepancies and Failures
- AI can create fake and false data/information, such as fake legal cases, political bias, and financial losses due to false predictions. Specifically:
  - Hallucinations, specifically inventing fake policies or fake news
  - Embedded Bias and Propaganda: Unfair treatment of women’s rights in Iran and Afghanistan, but portraying as women-friendly nations.
  - Financial and Strategic Miscalculations: Zillow’s home-buying algorithm overestimated house prices, causing millions of losses and layoffs.
- One pattern that makes these discrepancies noticeable is AI-generated information and misinformation; tools used to mislead individuals and causing harm to them.

WEEK 4 REFLECTION (3/3/2026): MACHINE LEARNING PIPELINE ON CYBERCRIME ANALYSIS AND FINANCIAL LOSSES

In this week’s reflection, I created a machine learning pipeline from a cybersecurity incident database that focuses on cyber-attacks on different countries and their financial losses. I specifically focus on:
1. Number of cyber-crime complaint reports;
2. Total financial losses from those crimes for 2019 - 2024.

*AI Use Disclaimer: AI tools such as ChatGPT were used as an assistant tool during this reflection to help understand machine learning concepts and code fixture. All analysis, interpretation, and final decisions were reviewed and completed by the author.*

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.5.3

Warning: package 'ggplot2' was built under R version 4.5.3

Warning: package 'tidyr' was built under R version 4.5.3

Warning: package 'readr' was built under R version 4.5.3

Warning: package 'dplyr' was built under R version 4.5.3

Warning: package 'lubridate' was built under R version 4.5.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   4.0.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

data <- read.csv("C:/Users/akbar/Downloads/LossFromNetCrime.csv",
                 stringsAsFactors = FALSE)

# Convert all complaint/loss columns except country to numeric
num_cols <- names(data)[names(data) != "Country"]

data[num_cols] <- lapply(data[num_cols], function(x) as.numeric(gsub(",", "", x)))

str(data)

'data.frame':   117 obs. of  13 variables:
 $ Country         : chr  "PR" "PS" "PT" "PY" ...
 $ X2019_Complaints: num  655 1784 1119 1913 5503 ...
 $ X2019_Losses    : num  5929974 22483591 13870074 10967865 48101706 ...
 $ X2020_Complaints: num  1338 2890 2020 2992 7390 ...
 $ X2020_Losses    : num  7209755 25423219 12391290 13815152 81178182 ...
 $ X2021_Complaints: num  1785 3352 2102 3188 10164 ...
 $ X2021_Losses    : num  9.46e+06 4.89e+07 1.82e+07 2.67e+07 1.32e+08 ...
 $ X2022_Complaints: num  1594 3210 1918 3768 10042 ...
 $ X2022_Losses    : num  1.72e+07 5.78e+07 3.09e+07 4.01e+07 1.87e+08 ...
 $ X2023_Complaints: num  1817 3378 2178 3487 11034 ...
 $ X2023_Losses    : num  2.10e+07 6.93e+07 2.87e+07 3.36e+07 2.44e+08 ...
 $ X2024_Complaints: num  1974 3811 2209 2678 12071 ...
 $ X2024_Losses    : num  3.15e+07 6.60e+07 4.02e+07 4.52e+07 2.81e+08 ...

summary(data)

   Country          X2019_Complaints  X2019_Losses       X2020_Complaints
 Length:117         Min.   :   216   Min.   :2.498e+06   Min.   :   362  
 Class :character   1st Qu.:  1119   1st Qu.:1.179e+07   1st Qu.:  1937  
 Mode  :character   Median :  5156   Median :4.571e+07   Median :  8187  
                    Mean   : 24650   Mean   :2.366e+08   Mean   : 43249  
                    3rd Qu.: 16525   3rd Qu.:2.025e+08   3rd Qu.: 28232  
                    Max.   :449305   Max.   :3.303e+09   Max.   :796395  
  X2020_Losses       X2021_Complaints  X2021_Losses       X2022_Complaints
 Min.   :2.673e+06   Min.   :   391   Min.   :4.363e+06   Min.   :   458  
 1st Qu.:1.382e+07   1st Qu.:  2102   1st Qu.:2.286e+07   1st Qu.:  1918  
 Median :4.529e+07   Median :  9415   Median :1.000e+08   Median :  8819  
 Mean   :2.812e+08   Mean   : 45731   Mean   :4.713e+08   Mean   : 42984  
 3rd Qu.:2.125e+08   3rd Qu.: 30367   3rd Qu.:4.003e+08   3rd Qu.: 29642  
 Max.   :3.907e+09   Max.   :940125   Max.   :6.467e+09   Max.   :769205  
  X2022_Losses       X2023_Complaints  X2023_Losses       X2024_Complaints
 Min.   :6.648e+06   Min.   :   400   Min.   :8.907e+06   Min.   :   330  
 1st Qu.:3.350e+07   1st Qu.:  2280   1st Qu.:4.132e+07   1st Qu.:  2253  
 Median :1.276e+08   Median :  9527   Median :1.631e+08   Median :  9251  
 Mean   :7.196e+08   Mean   : 45505   Mean   :8.590e+08   Mean   : 50734  
 3rd Qu.:6.055e+08   3rd Qu.: 31255   3rd Qu.:6.421e+08   3rd Qu.: 31525  
 Max.   :1.030e+10   Max.   :876894   Max.   :1.192e+10   Max.   :946966  
  X2024_Losses      
 Min.   :9.793e+06  
 1st Qu.:4.788e+07  
 Median :1.876e+08  
 Mean   :1.014e+09  
 3rd Qu.:8.030e+08  
 Max.   :1.446e+10

# Keep only rows with non-missing 2024 losses
data <- data[!is.na(data$X2024_Losses), ]

set.seed(123)
n <- nrow(data)
train_index <- sample(1:n, size = floor(0.8 * n))

train <- data[train_index, ]
test  <- data[-train_index, ]

# Variables used in the model
model_vars <- c(
  "X2019_Complaints","X2019_Losses",
  "X2020_Complaints","X2020_Losses",
  "X2021_Complaints","X2021_Losses",
  "X2022_Complaints","X2022_Losses",
  "X2023_Complaints","X2023_Losses",
  "X2024_Losses"
)

# Keep only rows without missing values
train_model <- train[complete.cases(train[, model_vars]), ]
test_model  <- test[complete.cases(test[, model_vars]), ]

nrow(train_model)

[1] 93

nrow(test_model)

[1] 24

model <- lm(X2024_Losses ~ X2019_Complaints + X2019_Losses +
              X2020_Complaints + X2020_Losses +
              X2021_Complaints + X2021_Losses +
              X2022_Complaints + X2022_Losses +
              X2023_Complaints + X2023_Losses,
            data = train_model)

summary(model)


Call:
lm(formula = X2024_Losses ~ X2019_Complaints + X2019_Losses + 
    X2020_Complaints + X2020_Losses + X2021_Complaints + X2021_Losses + 
    X2022_Complaints + X2022_Losses + X2023_Complaints + X2023_Losses, 
    data = train_model)

Residuals:
       Min         1Q     Median         3Q        Max 
-321874006   -9915622   -1716146   21722915  343895159 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       2.187e+06  1.241e+07   0.176 0.860517    
X2019_Complaints  4.788e+02  2.592e+03   0.185 0.853928    
X2019_Losses     -2.692e-01  5.359e-01  -0.502 0.616803    
X2020_Complaints  1.089e+04  2.420e+03   4.499 2.23e-05 ***
X2020_Losses      1.352e+00  3.368e-01   4.014 0.000132 ***
X2021_Complaints  1.471e+03  1.525e+03   0.964 0.337771    
X2021_Losses      4.631e-01  1.826e-01   2.535 0.013130 *  
X2022_Complaints -5.310e+03  2.005e+03  -2.648 0.009701 ** 
X2022_Losses      9.396e-01  1.303e-01   7.209 2.52e-10 ***
X2023_Complaints -6.364e+03  1.465e+03  -4.344 3.98e-05 ***
X2023_Losses     -2.637e-01  1.343e-01  -1.964 0.052954 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 105300000 on 82 degrees of freedom
Multiple R-squared:  0.9978,    Adjusted R-squared:  0.9976 
F-statistic:  3795 on 10 and 82 DF,  p-value: < 2.2e-16

predictions <- predict(model, newdata = test_model)

rmse <- sqrt(mean((predictions - test_model$X2024_Losses)^2))
rmse

[1] 406916980

str(data)

'data.frame':   117 obs. of  13 variables:
 $ Country         : chr  "PR" "PS" "PT" "PY" ...
 $ X2019_Complaints: num  655 1784 1119 1913 5503 ...
 $ X2019_Losses    : num  5929974 22483591 13870074 10967865 48101706 ...
 $ X2020_Complaints: num  1338 2890 2020 2992 7390 ...
 $ X2020_Losses    : num  7209755 25423219 12391290 13815152 81178182 ...
 $ X2021_Complaints: num  1785 3352 2102 3188 10164 ...
 $ X2021_Losses    : num  9.46e+06 4.89e+07 1.82e+07 2.67e+07 1.32e+08 ...
 $ X2022_Complaints: num  1594 3210 1918 3768 10042 ...
 $ X2022_Losses    : num  1.72e+07 5.78e+07 3.09e+07 4.01e+07 1.87e+08 ...
 $ X2023_Complaints: num  1817 3378 2178 3487 11034 ...
 $ X2023_Losses    : num  2.10e+07 6.93e+07 2.87e+07 3.36e+07 2.44e+08 ...
 $ X2024_Complaints: num  1974 3811 2209 2678 12071 ...
 $ X2024_Losses    : num  3.15e+07 6.60e+07 4.02e+07 4.52e+07 2.81e+08 ...

colSums(is.na(data))

         Country X2019_Complaints     X2019_Losses X2020_Complaints 
               1                0                0                0 
    X2020_Losses X2021_Complaints     X2021_Losses X2022_Complaints 
               0                0                0                0 
    X2022_Losses X2023_Complaints     X2023_Losses X2024_Complaints 
               0                0                0                0 
    X2024_Losses 
               0

# Compare training vs test error
train_predictions <- predict(model, newdata = train_model)

train_rmse <- sqrt(mean((train_predictions - train_model$X2024_Losses)^2))
test_rmse <- sqrt(mean((predictions - test_model$X2024_Losses)^2))

train_rmse

[1] 98847983

test_rmse

[1] 406916980

# Looking at residuals to help with error analysis
plot(model$fitted.values, model$residuals,
     xlab = "Fitted Values",
     ylab = "Residuals",
     main = "Residual Plot")
abline(h = 0, lty = 2)

# Checking actual vs predicted plots
results <- data.frame(
  Actual = test_model$X2024_Losses,
  Predicted = predictions
)

head(results)

        Actual  Predicted
1     31545772   22829813
2     66002407   87898101
10   448223952  470210159
11   145551079  205914199
18 11161782401 9824492611
19   303148050  321448290

The dataset is first imported and cleaned by removing rows with missing outcome values and converting the complaint and loss variables to numeric format. Following after, since the dataset contains too much data I split it into training and testing sets, which led to a linear regression being trained on the training data. The predictions are generated for the test data, and model performance is reviewed using Root Mean Squared Error (RMSE).
For the error analysis, the RMSE reveals how the model behaved across observations through the comparison of training vs testing errors. The residual plot demonstrates whether the model prediction errors followed any over or under-predicted losses.
Overfitting focuses on the differences between learning knowledge and learning noise in the data. When it comes to overfitting, a model can capture random fluctuations in the training data instead of generalizing patterns.

WEEK 5 REFLECTION (3/10/2026): PREDICTION VS EXPLANATION, SIMPLE CAUSAL RELATIONSHIP DIAGRAM AND DIFFERENTIATING CAUSAL VS PREDICTIVE CLAIMS:

Prediction matters when there’s is a clear goal or forecasting something accurately. When a policymaker wants to predict how to reduce crime and poverty in a state, they outline their ideas and pass a policy or law in hopes of producing an effective outcome.
Explanation matters when something happens and therefore needs to be a reason or a statement supporting the outcome. When a law or policy has been passed, the policymaker gives his/her reasoning why it’s effective. Prediction is the foreshadowing of an event, whereas an explanation is the reasoning or conclusion of the prediction taking place.

For the diagram above, the co-founders are variables that could affect both social relationships and health. Such variables could be age, income/socioeconomic status, education, access to healthcare, lifestyle and mental health.
For distinguishing a causal claim from a predictive association, a causal claim would be just like the causal diagram above. A predictive association would only indicate that individuals with stronger relationship skills tend to have better health and longer lives, without social relationships being the reason.

WEEK 6 REFLECTION (3/17/2026): TEXT MINING, NLP, LLM, AND AI FOR RESEARCH GUIDE

Google AI Overview states that text-mining, often referred to as “distant reading,” excels at identifying broad patterns, trends, and structures across extensive datasets – such as numerous books or millions of social media posts – that are impossible for a person to read. While close reading focuses on interpreting individual texts to find detailed qualitative meaning, text-mining provides a data-focused perspective of textual patterns. The blind spots however, include context, irony, and the nuanced interpretation of low-probability words.
Natural Language Processing (NLP) and Large Language Models (LLM) can assist in research by analyzing a vast number of unstructured datasets and accelerate them through routine tasks. An example of this scenario is a data analyst having unorganized datasets (papers, documents, files), and using these models to organize, clean up its unnecessary parts and complete them at a faster rate.
Google AI Overview states NLP and LLM models having a couple of limitations, such as generating false information, having the inability to reason logically, and having a lack of real-time knowledge. Suggested solutions include clear prompt-engineering, specifically providing the models clear instructions through specific prompts. Another suggestion is through Retrieval-Augmented Generation (RAG), which helps providing external knowledge and hybrid approaches, combining traditional NLP with LLM together.

AI For Research Guide

Note: My ideas for this AI Research Guide were informed in part by Harvard’s Library Artificial Intelligence for Research and Scholarship guide.

AI can help researchers discover literature, summarize key points and trends.
It can break down large pieces of information into smaller portions.
AI can assist with analyzing, clarifying and editing data and content.
AI-generated information should be always verified by multiple reliable sources.
It is encouraged to use AI for literature research when looking for new resources.

Citation References:

“Research Guides: Artificial Intelligence for Research and Scholarship: AI for Research.” 2026. AI for Research - Artificial Intelligence for Research and Scholarship - Research Guides at Harvard Library. Accessed March 4, 2026. https://guides.library.harvard.edu/c.php?g=1330621&p=10034534

WEEK 7 REFLECTION (3/24/2026): LLM FAILURES, AI IN THE FUTURE

According to Google AI Overview, LLMs exceed in text mining, summarizing information, and coding. However, they lack in high accuracy tasks, logical reasoning, updated knowledge, and have a limited understanding of the real world. When I first gave ChatGPT a prompt for a project I was working on, it understood the general requirements, but the code it generated was incorrect in the first couple of attempts. After I provided my working directory and database structure, the model was able to produce the correct code and assist me more effectively.
With my pursuit of becoming a data analyst/scientist in the next five years, generative AI can help me in the future with coding assistance, breaking down big portions of information into smaller bits, and editing data. I believe that AI isn’t and shouldn’t take over the data industry, since data-review and verification needs to be hand-operated. AI can and will be helpful for coding assistance and visualization, but the originality, creativity and ideas need to remain human-driven.

WEEK 8 REFLECTION (3/31/2026): RAG CRITICISMS AND IMPROVEMENTS

According to Google AI Overview, RAG (Retrieval-Augmented Generation) is an AI framework that LLM outputs by referencing internal documents or the internet before generating a response, which matters for research since it stays up-to-date, reduces hallucinations, specializes data, information, and offers cited and verified facts.
Grounding generation in retrieved documents change the trustworthiness of AI output through verifying evidences, reducing misinformation, being contextually accurate, and having AI easily explaining it with increased confidence in the output.
Some criticisms of RAG include inaccuracies and unreliability, as well as delivering outdated or incomplete information.
In order to improve generation is through advanced retrieval techniques, data preparation & chunking.

WEEK 9 REFLECTION (4/7/2026): KNOWLEDGE GRAPHS, WITH ITS NODES AND RELATIONSHIPS

Google Gemini states that a knowledge graph is a structured representation of information that models as a network of connected entities and relationships.
It differs from a traditional database and vector store because its data structure is made of of nodes (entities) and edges (relations), whereas traditional (relational) databases contains rows and columns, and vector store contain high-dimensional numerical arrays.
The reason why relationships between entities for knowledge mining are valuable is due to multiple factors: contextual disambiguation, inference and reasoning, pattern and path discovery, and enhancing AI accuracy, which is done through RAG.
In terms of a social data analyst perspective, three nodes could be neighborhoods, ZIP codes and specific metropolitan areas, and three relationships could be neighborhoods being located within ZIP codes, which is part of a metropolitan zone. The reason why this structure helps answer question a flat dataset cannot is due to flat datasets being built for direct attributes, whereas knowledge graphs are built for indirect paths.

WEEK 10 REFLECTION (4/14/2026): AI FOR SCIENCE AND AGENTIC AI SYSTEMS

ChatGPT defines AI for Science as the utilization of Artificial Intelligence to help with making scientific discoveries, where it helps scientists with data analysis, organization, cleaning, and solving complex model systems.
Agentic AI systems are designed to be efficient research assistants compared to single-prompt interactions., where it can manage breaking tasks into smaller parts, analyze outcomes, revise strategies and proceeds to work through the problems with stages.