Psychedelic Data Science πŸ„

πŸŽ„πŸŽ Advent of Open Source – Day 18/24: A fun project analyzing vocabulary richness in psychedelic trip reports.

(See my intro post)

πŸ“ Note: Running out of awesome projects and I want to save the best for last, so today just something fun.

While preparing for this advent calendar, I browsed through my 382 GitHub repositories and rediscovered this time capsule from 8 years ago. It explores the language of psychedelic experiences through data science - with an interesting hypothesis: do psychedelic experiences generate richer vocabulary compared to other substance reports?

πŸ“– Origin Story

Back in 2015, when I was still “young” (before back pain and two-day hangovers became a thing), I was learning data science and natural language processing. I had an amusing hypothesis: surely people describing their psychedelic experiences would use richer vocabulary than those writing about stimulants - I mean, who hasn’t pondered the linguistic complexity of “everything is connected” versus “I cleaned my entire apartment at 4 AM”? Erowid.org, with its thousands of detailed first-person narratives across different substances, provided the perfect dataset to test this theory.

πŸ”§ Technical Highlights

  • Natural language processing of experience reports
  • Vocabulary richness analysis across substance categories
  • TF-IDF vectorization for text analysis
  • Support Vector Machine classification of experience types
  • Web scraping with BeautifulSoup
  • K-means clustering for discovering common themes
  • Word cloud generation to visualize vocabulary differences

πŸ“Š Impact

  • A learning project that helped understand:
    • How different experiences shape language use
    • Processing subjective experience narratives
    • Document classification techniques
    • The challenges of quantifying vocabulary richness
  • 10 GitHub stars

🎯 Challenges and Solutions

  • Implementing respectful web scraping
  • Controlling for report length and education level
  • Creating meaningful metrics for vocabulary richness
  • Visualizing linguistic patterns across categories

πŸ’‘ Lessons Learned

  1. Putting things on GitHub means it’s less likely forgotten things get lost
  2. Early data science projects often reveal our initial fascinations
  3. Web scraping requires both technical skill and ethical consideration
  4. Text analysis tools have evolved dramatically since 2015
  5. Sometimes the most interesting projects are the ones you almost forgot about

Want to explore this intersection of psychedelics, language, and data science? Check out the project on GitHub!

#OpenSource #Python #DataScience #NLP #MachineLearning

Edit this page

Bas Nijholt
Bas Nijholt
Staff Engineer

Hi.

Related