Machine Learning and Large Language Models for Harnessing Unstructured Data using Stata/ Python

A 3-day intensive in-person school

This intensive three-day workshop brings together two core pillars of contemporary research in data science: Large Language Models (LLMs) to work with unstructured text analysis and feature engineering, and Machine Learning (ML) & AI modeling in Stata and Python for predictive analytics. Participants will gain the ability to design full research pipelines, starting from messy raw text, structuring it into analyzable features, and applying supervised/unsupervised ML models to generate predictions, improve data insights, and produce reliable classifications.

Dr Loreta Isaraj

Dr. Giovanni Cerulli

Inscreva-se aqui

US$ 540,00

Guaranteed safe and secure checkout

: 8 – 10 de jun. de 2026
: 3 Days
: London
: Stata

Overview

This intensive three-day workshop brings together two core pillars of contemporary research in science: Large Language Models to work with unstructured text analysis and feature engireeing, and Machine Learning & AI modeling in Stata and Python for predictive analytics.

Participants will gain the ability to design full research pipelines, starting from messy raw text, structuring it into analyzable features, and applying supervised/ unsupervised ML models to generate predictions, improve data insights, and produce reliable classifications. The course belnds theory, methodological insights, and hands-on coding in Stata/ Python and LLMs. Case studies across the social sciences, economics, health, and humanities illustrate practical relevance.

The course is ideal for PhD students, early-career researchers, data scientists, and applied professionals interested in test-driven machine learning applications.

Key Skills Acquired

By the end of the course, students will understand:

How to clean, preprocess, and structure raw text data, turning it into feature sets suitable for machine learning
How to train, validate, and evaluate machine learning models on text-derived features, balancing predictive accuracy and interpretability.
How to integrate advanced neural models into their text analytics workflows, critically evaluate applications across domains, and build responsible, reproducible piplines

Learning Outcomes

Master the entire pipeline from raw text to validated ML model
Gain hands-on experience in Python LLMs/ NLP and ML libraries
Understand the trade-offs between traditional ML and neural AI models
Be equipped to apply these techniques in their own domain
Develop critical awareness of ethical challenges, data privacy, and responsible AI practices

Course Structure

The workshops is organised across three progessive days:

Day 1 introduces the foundations of LLMs text analytics, covering text preprocessing, natural language processing (NLP) techniques, and feature extraction from raw data
Day 2 focuses on machine learning fundamentals for text-based research, including model construction, validation techniques, classical algorithms, and dimensionality reduction
Day 3 explores advanced LLMs and AI integration through neural networks and transforormer-based models, followed by domain-specific applications, ethical considerations, and a hands-on capstone project

Across all sessions, participants work directly in Stata and Python to develop reproducible text-analytic workflows and gain practical algorithmic experience.

Agenda

Day 1

Morning Session: Understanding Text as Data

Lecture & Discussion:

The role of unstructured data in modern research (80% of today's data is unstructured)
Sources of textual data: research articles, policy reports, legal texts, medical records, social media feeds, OCR-scanned documents)
Conceptual challenges: ambiguity, context, semantics, scale
Ethical issues: bias, misinformation risks, transparency, privacy, and data

Practical Exercise

Exploring raw textual datasets (academic abstracts, news articles)
Identifying challenges of unstructured inputs

Afternoon Session: Preprocessing and Feature Extraction

Lecture

Introduction to LLMs and Natural Language Processing (NLP)
Core preprocessing steps: tokenization, stemming vs lemmatization, stopword removal
Named Entity Recognition (NER): extracting people, organisations, places
Sentiment and opinion mining
Feature engineering from text: bag of words, TF-IDF, n-grams
Embeddings: word2Vec, GloVe, BERT, OpenAI embeddings - moving from surface forms to semantic meaning

Hands-on in Stata/ Python

Using spaCY and NLTK for preprocessing
Running NER and sentiment analysis
Generating feature matrices with TF-IDF and embeddings
Comparing results across methods

Day 2

Morning Session: Model Building and Validation

Lecture

Introduction to supervised ML: regression vs. classification tasks
The bias-variance trade-off and overfitting in text models
Train/ test splits, validation sets, k-fold cross-validation, bootstrap methods
Metrics for model performance: accuracy, precision, recall, F1, ROC curves, AUC
Beyond accuracy: interpretability and fairness

Hands-on in Python

Building a baseline classifer using text features
Comparing train/ test errors
Computing precision, recall, F1, and plotting ROC curves

Afternoon Session: Core ML Algorithms with Text Features

Lecture

Regularisation for high-dimensional text data: Lasso, Ridge, Elastic Net
Naive Bayes for text classification (fast, interpretable, effective)
K-Nearest Neighbours (KNN) and its limits in high-dimensional feature spaces
Logistic regression as a fundamental classifer
Introduction to dimensionality reduction for text: PCA and topic modeling (LDA)

Hands-on in Stata/ Python

Buidling multiple classifiers (Naive Bayes, Logistic Regression, KNN)
Using cross-validation to tune hyperparameters
Comparing models using validation metrics

Day 3

Morning Session: Deep Learning and Neural Approaches to Text

Lecture:

Neural networks for text: from simple feedforward architectures to more advanced models
LLMs and the basics of transformers: BERT, GPT, and modern embeddings
Fine-tuning vs using pre-trained models
Challenges with high-dimensional data, training stability, and computational costs

Hands-on in Stata/ Python

Building a simple feedforward neural network on text data
Using Hugging Face to fine-tune a small transformer for text classification
Comparing traditional ML vs. deep learning results

Afternoon Session: Applications, Ethics, and Capstone Project

Case Studies by Domain

Social sciences: topic modeling of policy documents for comparative analysis
Economics/ finance: predicting stock market signals from news and financial reports
Medicine: mining patient narratives for diagnostic insights
Humanities: automated classification of historial texts and literacy themes

Discussion

Responsible AI in practice: bias in text models, misinformation, transparency, reproducibility
Workflow automation with Python + APIs (OpenAI, Hugging Face)
Building scalable pipelines

Capstone Hands-on Project

Participants design and implement a mini-project, integrating the full workflow: text collection - preprocessing-feature engineering-model training-validation-interpretation
Results are presented in small groups with peer feedback

Prerequisites

No prior knowledge of LLM or Python required

Basic Regression and Statistics knowledge

Working knowledge of Stata

Course Timetable

*Subject to minor changes*
Day	Morning Session	Afternoon Session (including tutorial)
Day One	10am-12pm (London time)	1pm-3pm (London time)
Day Two	9am-11pm (London time)	12pm-2pm (London time)

Entregue por

Dr. Giovanni Cerulli

IRCrES–CNR

Saber mais
Dr Loreta Isaraj

National Research Council of Italy

Saber mais

Name	Description	Lifetime
ADD_TO_CART	(Adobe Commerce only) Used by Google Tag Manager	1 Year
GUEST-VIEW	Stores the Order ID that guest shoppers use to retrieve their order status. Guest orders view. Used in Orders and Returns widgets	1 Year
LOGIN_REDIRECT	Preserves the destination page that was loading before the customer was directed to log in	1 Year
MAGE-BANNERS-CACHE-STORAGE	(Adobe Commerce only) Stores banner content locally to improve performance	1 Year
MAGE-MESSAGES	Tracks error messages and other notifications that are shown to the user	1 Year
MAGE-TRANSLATION-STORAGE	Stores translated content when requested by the shopper	1 Year
MAGE-TRANSLATION-FILE-VERSION	Tracks the version of translations in local storage	1 Year
PRODUCT_DATA_STORAGE	Stores configuration for product data related to Recently Viewed/Compared Products	1 Year
RECENTLY_COMPARED_PRODUCT	Stores product IDs of recently compared products	1 Year
RECENTLY_COMPARED_PRODUCT_PREVIOUS	Stores product IDs of previously compared products for easy navigation	1 Year
RECENTLY_VIEWED_PRODUCT	Stores product IDs of recently viewed products for easy navigation	1 Year
RECENTLY_VIEWED_PRODUCT_PREVIOUS	Stores product IDs of recently previously viewed products for easy navigation	1 Year
REMOVE_FROM_CART	(Adobe Commerce only) Used by Google Tag Manager	1 Year
STF	Records the time messages are sent by the SendFriend	1 Year
X-MAGENTO-VARY	Configuration setting that improves performance when using Varnish static content caching	1 Year
FORM_KEY	A security measure that appends a random string to all form submissions to protect the data from Cross-Site Request Forgery	1 Year
MAGE-CACHE-SESSID	The value of this cookie triggers the cleanup of local cache storage	1 Year
MAGE-CACHE-STORAGE	Local storage of visitor-specific content that enables ecommerce functions	1 Year
MAGE-CACHE-STORAGE-SECTION-INVALIDATION	Forces local storage of specific content sections that should be invalidated	1 Year
PERSISTENT_SHOPPING_CART	Stores the key (ID) of persistent cart to make it possible to restore the cart for an anonymous shopper	1 Year
PRIVATE_CONTENT_VERSION	Appends a random, unique number and time to pages with customer content to prevent them from being cached on the server	1 Year
SECTION_DATA_IDS	Stores customer-specific information related to shopper-initiated actions, such as wish list display and checkout information	1 Year
STORE	Tracks the specific store view/locale selected by the shopper	1 Year

Name	Description	Lifetime
CUSTOMER_SEGMENT_IDS	Stores your Customer Segment ID	1 Year
EXTERNAL_NO_CACHE	A flag that, indicates whether caching is on or off	1 Year
FRONTEND	Your session ID on the server	1 Year
GUEST-VIEW	Allows guests to edit their orders	1 Year
LAST_CATEGORY	The last category you visited	1 Year
LAST_PRODUCT	The last product you looked at	1 Year
NEWMESSAGE	Indicates whether a new message has been received	1 Year
NO_CACHE	Indicates whether it is allowed to use cache	1 Year

Name	Description	Lifetime
MG_DNT	Allows you to restrict Adobe Commerce data collection if you have custom code to manage cookie consent on your site	1 Year
USER_ALLOWED_SAVE_COOKIE	Used for cookie restriction mode	1 Year
AUTHENTICATION_FLAG	Indicates if a shopper has signed in or signed out	1 Year
DATASERVICES_CUSTOMER_ID	Indicates if a shopper has signed in or signed out	1 Year
DATASERVICES_CUSTOMER_GROUP	Indicates a customer's group. This cookie is stored as sha1 checksum of the customer's group ID	1 Year
DATASERVICES_CART_ID	Identifies a shopper's cart actions	1 Year
DATASERVICES_PRODUCT_CONTEXT	Identifies a shopper's product interactions. This cookie contains the customer's unique quote ID in the system	1 Year

Name	Description	Lifetime
_ga	Used by Google Analytics	1 Year
_ga_*	Used by Google Analytics	1 Year

Machine Learning and Large Language Models for Harnessing Unstructured Data using Stata/ Python

A 3-day intensive in-person school

Inscreva-se aqui

Overview

Key Skills Acquired

Learning Outcomes

Course Structure

Agenda

Day 1

Day 2

Day 3

Prerequisites

Course Timetable

Entregue por

Dr. Giovanni Cerulli

Dr Loreta Isaraj

Privacy Overview

Essential

Marketing

Functionality

Statistical