Inscreva-se aqui
- –
- 3 Days
- London
- Stata
Overview
This intensive three-day workshop brings together two core pillars of contemporary research in science: Large Language Models to work with unstructured text analysis and feature engireeing, and Machine Learning & AI modeling in Stata and Python for predictive analytics.
Participants will gain the ability to design full research pipelines, starting from messy raw text, structuring it into analyzable features, and applying supervised/ unsupervised ML models to generate predictions, improve data insights, and produce reliable classifications. The course belnds theory, methodological insights, and hands-on coding in Stata/ Python and LLMs. Case studies across the social sciences, economics, health, and humanities illustrate practical relevance.
The course is ideal for PhD students, early-career researchers, data scientists, and applied professionals interested in test-driven machine learning applications.
Key Skills Acquired
By the end of the course, students will understand:
-
How to clean, preprocess, and structure raw text data, turning it into feature sets suitable for machine learning
-
How to train, validate, and evaluate machine learning models on text-derived features, balancing predictive accuracy and interpretability.
-
How to integrate advanced neural models into their text analytics workflows, critically evaluate applications across domains, and build responsible, reproducible piplines
Learning Outcomes
-
Master the entire pipeline from raw text to validated ML model
-
Gain hands-on experience in Python LLMs/ NLP and ML libraries
-
Understand the trade-offs between traditional ML and neural AI models
-
Be equipped to apply these techniques in their own domain
-
Develop critical awareness of ethical challenges, data privacy, and responsible AI practices
Course Structure
The workshops is organised across three progessive days:
-
Day 1 introduces the foundations of LLMs text analytics, covering text preprocessing, natural language processing (NLP) techniques, and feature extraction from raw data
-
Day 2 focuses on machine learning fundamentals for text-based research, including model construction, validation techniques, classical algorithms, and dimensionality reduction
-
Day 3 explores advanced LLMs and AI integration through neural networks and transforormer-based models, followed by domain-specific applications, ethical considerations, and a hands-on capstone project
Across all sessions, participants work directly in Stata and Python to develop reproducible text-analytic workflows and gain practical algorithmic experience.
Agenda
Day 1
Lecture & Discussion:
- The role of unstructured data in modern research (80% of today's data is unstructured)
- Sources of textual data: research articles, policy reports, legal texts, medical records, social media feeds, OCR-scanned documents)
- Conceptual challenges: ambiguity, context, semantics, scale
- Ethical issues: bias, misinformation risks, transparency, privacy, and data
Practical Exercise
- Exploring raw textual datasets (academic abstracts, news articles)
- Identifying challenges of unstructured inputs
Lecture
- Introduction to LLMs and Natural Language Processing (NLP)
- Core preprocessing steps: tokenization, stemming vs lemmatization, stopword removal
- Named Entity Recognition (NER): extracting people, organisations, places
- Sentiment and opinion mining
- Feature engineering from text: bag of words, TF-IDF, n-grams
- Embeddings: word2Vec, GloVe, BERT, OpenAI embeddings - moving from surface forms to semantic meaning
Hands-on in Stata/ Python
- Using spaCY and NLTK for preprocessing
- Running NER and sentiment analysis
- Generating feature matrices with TF-IDF and embeddings
- Comparing results across methods
Day 2
Lecture
- Introduction to supervised ML: regression vs. classification tasks
- The bias-variance trade-off and overfitting in text models
- Train/ test splits, validation sets, k-fold cross-validation, bootstrap methods
- Metrics for model performance: accuracy, precision, recall, F1, ROC curves, AUC
- Beyond accuracy: interpretability and fairness
Hands-on in Python
- Building a baseline classifer using text features
- Comparing train/ test errors
- Computing precision, recall, F1, and plotting ROC curves
Lecture
- Regularisation for high-dimensional text data: Lasso, Ridge, Elastic Net
- Naive Bayes for text classification (fast, interpretable, effective)
- K-Nearest Neighbours (KNN) and its limits in high-dimensional feature spaces
- Logistic regression as a fundamental classifer
- Introduction to dimensionality reduction for text: PCA and topic modeling (LDA)
Hands-on in Stata/ Python
- Buidling multiple classifiers (Naive Bayes, Logistic Regression, KNN)
- Using cross-validation to tune hyperparameters
- Comparing models using validation metrics
Day 3
Lecture:
- Neural networks for text: from simple feedforward architectures to more advanced models
- LLMs and the basics of transformers: BERT, GPT, and modern embeddings
- Fine-tuning vs using pre-trained models
- Challenges with high-dimensional data, training stability, and computational costs
Hands-on in Stata/ Python
- Building a simple feedforward neural network on text data
- Using Hugging Face to fine-tune a small transformer for text classification
- Comparing traditional ML vs. deep learning results
Case Studies by Domain
- Social sciences: topic modeling of policy documents for comparative analysis
- Economics/ finance: predicting stock market signals from news and financial reports
- Medicine: mining patient narratives for diagnostic insights
- Humanities: automated classification of historial texts and literacy themes
Discussion
- Responsible AI in practice: bias in text models, misinformation, transparency, reproducibility
- Workflow automation with Python + APIs (OpenAI, Hugging Face)
- Building scalable pipelines
Capstone Hands-on Project
- Participants design and implement a mini-project, integrating the full workflow: text collection - preprocessing-feature engineering-model training-validation-interpretation
- Results are presented in small groups with peer feedback
Prerequisites
No prior knowledge of LLM or Python required
Basic Regression and Statistics knowledge
Working knowledge of Stata