Psychedelic Data Science 🍄

🎄🎁 Advent of Open Source – Day 18/24: A fun project analyzing vocabulary richness in psychedelic trip reports.

Last updated on 2024-12-29 2 min read Technology, Open-Source, Advent

📝 Note: Running out of awesome projects and I want to save the best for last, so today just something fun.

While preparing for this advent calendar, I browsed through my 382 GitHub repositories and rediscovered this time capsule from 8 years ago. It explores the language of psychedelic experiences through data science - with an interesting hypothesis: do psychedelic experiences generate richer vocabulary compared to other substance reports?

📖 Origin Story

Back in 2015, when I was still “young” (before back pain and two-day hangovers became a thing), I was learning data science and natural language processing. I had an amusing hypothesis: surely people describing their psychedelic experiences would use richer vocabulary than those writing about stimulants - I mean, who hasn’t pondered the linguistic complexity of “everything is connected” versus “I cleaned my entire apartment at 4 AM”? Erowid.org, with its thousands of detailed first-person narratives across different substances, provided the perfect dataset to test this theory.

🔧 Technical Highlights

Natural language processing of experience reports
Vocabulary richness analysis across substance categories
TF-IDF vectorization for text analysis
Support Vector Machine classification of experience types
Web scraping with BeautifulSoup
K-means clustering for discovering common themes
Word cloud generation to visualize vocabulary differences

📊 Impact

A learning project that helped understand:
- How different experiences shape language use
- Processing subjective experience narratives
- Document classification techniques
- The challenges of quantifying vocabulary richness
10 GitHub stars

🎯 Challenges and Solutions

Implementing respectful web scraping
Controlling for report length and education level
Creating meaningful metrics for vocabulary richness
Visualizing linguistic patterns across categories

💡 Lessons Learned

Putting things on GitHub means it’s less likely forgotten things get lost
Early data science projects often reveal our initial fascinations
Web scraping requires both technical skill and ethical consideration
Text analysis tools have evolved dramatically since 2015
Sometimes the most interesting projects are the ones you almost forgot about

Want to explore this intersection of psychedelics, language, and data science? Check out the project on GitHub!

#OpenSource #Python #DataScience #NLP #MachineLearning

Edit this page

Open-Source Python Datascience Nlp Machinelearning Advent

Psychedelic Data Science 🍄

📖 Origin Story

🔧 Technical Highlights

📊 Impact

🎯 Challenges and Solutions

💡 Lessons Learned

Bas Nijholt

Staff Engineer

Related