Social Media Archive Project

Preserving and intelligently organizing media content using data scraping and machine learning

Data visualization on screens

Project Overview

Social media platforms constantly evolve, and unfortunately, that often means content disappears—whether through policy changes, account suspensions, or platform restructuring. This project was born from the need to reliably preserve media content I had engaged with and wanted to reference later across multiple platforms. It started with Twitter, where content takedowns are common, but quickly expanded to include numerous other websites and social media sources.

The project evolved into a comprehensive media management system with three key components: an automated data scraping solution to extract and download media, a machine learning-powered classification system to intelligently categorize the downloaded content, and a custom media browser for efficient navigation of the resulting archive.

What began as a simple archiving tool transformed into an exploration of web automation, machine learning implementation, and graphical user interface development across multiple programming languages and technology stacks.

🎯

Goal

Create a system to archive, intelligently categorize, and efficiently browse personally curated social media content.

⏱️

Timeline

3 months (July - September 2023), with iterative improvements throughout the process.

đź§ 

Role

Full-stack developer handling all aspects: architecture, data extraction, machine learning implementation, and UI development.

🛠️

Tools & Technologies

PowerShell, Selenium WebDriver, C#/.NET, TensorFlow, Python, Tkinter, OpenCV, JDownloader

Challenge & Solution

The Challenge

The project presented several significant technical and architectural challenges that required innovative solutions across different technology stacks:

  • Initially navigating Twitter's complex, dynamic HTML structure to scrape media without an API (which would have been prohibitively expensive)
  • Expanding to multiple platforms and websites to build a diverse media collection
  • Processing, categorizing, and organizing thousands of media files from various sources efficiently to create a usable personal archive
  • Developing a lightweight yet powerful media browser that could handle various file formats including images, GIFs, and videos
  • Implementing machine learning for content classification without reinventing complex neural network architectures
  • Ensuring the system could run across different operating systems (macOS for initial development, Windows for GPU-accelerated classification)

The Solution

I created a multi-stage system that addressed these challenges through specialized components:

  • Data Extraction: Developed a PowerShell script with Selenium WebDriver that leveraged the "OldTwitter" browser extension to navigate Twitter's simplified HTML structure and extract media efficiently
  • Content Acquisition: Initially used custom Twitter-focused scraping scripts, later transitioned to JDownloader for bulk downloading from a wide variety of platforms and websites
  • Intelligent Categorization: Built a C# application using a .NET (based on TensorFlow) for automated content classification with GPU acceleration
  • Media Browser: Created a custom Python/Tkinter application with integrated video/GIF playback, responsive controls, and preloading for a smooth browsing experience
  • Cross-Platform Workflow: Designed a process that utilized the strengths of each operating system—macOS for initial development and Windows for GPU-accelerated machine learning

Process & Methodology

The development process unfolded in distinct phases, each requiring different technical approaches and problem-solving strategies. I employed an iterative methodology, testing and refining each component before moving to the next stage.

1

Data Extraction & Scraping Development

The initial phase focused on solving the fundamental challenge of extracting media from Twitter as my first target platform. I explored several approaches before landing on a PowerShell + Selenium solution. A critical breakthrough came when I discovered the "OldTwitter" browser extension, which significantly simplified the HTML structure and made scraping more reliable.

The script evolution involved multiple iterations to handle various edge cases:

  • Processing JSON data from Twitter's "liked" posts
  • Navigating to each post and identifying media elements
  • Selecting high-quality versions of images and videos
  • Creating a resumable download process that could continue after interruptions
  • Intelligent file naming based on tweet content and sequential numbering
2

Machine Learning Implementation

After successfully extracting media, I needed a way to intelligently categorize it. I evaluated several machine learning approaches and ultimately chose to build a C# application leveraging a content categorization .NET library, which provided a pre-trained model specifically designed for content classification.

This phase involved significant technical challenges:

  • Setting up the appropriate development environment with .NET dependencies
  • Configuring GPU acceleration with CUDA 10.1 and cuDNN 7.6.x to significantly speed up processing
  • Creating a robust file processing system that could handle thousands of images and videos
  • Implementing graceful error handling to skip problematic files without crashing
  • Structuring output directories to maintain organization of the classified content
3

Media Browser Development

The final phase focused on creating a user-friendly way to browse and interact with the categorized media collection. I developed a Python application using Tkinter for the GUI and integrated multiple media-handling libraries for a seamless viewing experience.

The viewer evolved from a simple file opener to a sophisticated media browser with features like:

  • Integrated display for images, GIFs, and videos without requiring external applications
  • Intuitive navigation controls for moving through the media collection
  • Media preloading for smoother transitions between items
  • Video playback controls with timeline scrubbing
  • GIF speed adjustment capabilities
  • Browsing history to track previously viewed items
Custom media browser interface showing the Python/Tkinter application

Screenshot of the custom media browser UI built with Python and Tkinter

4

System Integration & Refinement

With the core components developed, I focused on integrating them into a cohesive workflow and refining the entire system. This included:

  • Transitioning from Twitter-specific custom scraping scripts to JDownloader for more efficient bulk downloading from multiple platforms and websites
  • Optimizing the machine learning component for faster processing by leveraging GPU acceleration
  • Fine-tuning the media browser for better performance with large collections
  • Creating a clear workflow documentation for the entire process
  • Testing the system end-to-end with real-world data to ensure reliability

This phase was crucial for ensuring the components worked together seamlessly and could handle the volume of media being processed.

Technical Implementation

The technical implementation of this project spanned multiple programming languages and frameworks, each chosen for its specific strengths in addressing different aspects of the overall solution.

Data Extraction Script (PowerShell + Selenium)

The data extraction script was built in PowerShell using Selenium WebDriver to automate browser interactions. Key technical features included:

  • JSON parsing of Twitter's "like.json" file to extract tweet IDs and metadata
  • Browser automation with ChromeDriver to navigate to each tweet URL
  • Dynamic waiting for page elements using WebDriverWait to handle variable loading times
  • CSS selector targeting optimized for the "OldTwitter" extension's simplified HTML structure
  • Intelligent filtering to exclude unwanted elements like avatars and UI components
  • Quality selection logic for video sources to ensure highest resolution downloads
  • Resumable download tracking using a persistent log file

The script was designed to be resilient to network issues and could be paused and resumed without losing progress, making it suitable for processing large collections over extended periods. While these scripts were Twitter-specific, they were an important starting point before I discovered JDownloader's capabilities for multi-platform collection.

Content Classification (C# + a .NET)

The content classification component was implemented in C# using a content categorization .NET library, which leverages TensorFlow for machine learning capabilities. Technical highlights included:

  • GPU acceleration configuration with CUDA 10.1 and cuDNN 7.6.x for optimal performance
  • Specialized handling for different media types (images vs. videos)
  • Asynchronous processing for images to improve throughput
  • Robust error handling with try-catch blocks to skip problematic files
  • Progress tracking with percentage completion reporting
  • Automatic directory creation and file organization based on classification results

This component demonstrated significant performance improvements when running on a Windows system with an NVIDIA GTX 1080 Ti GPU compared to CPU-only processing, reducing classification time from hours to minutes for large collections.

Media Browser (Python + Tkinter)

The custom media browser was developed in Python using Tkinter for the GUI, with PIL/Pillow and OpenCV for media processing. Key technical implementations included:

  • Unified interface for viewing images, GIFs, and videos within a single application
  • Custom video player implementation with timeline scrubbing and playback controls
  • GIF handling with frame-by-frame control and speed adjustment
  • Media preloading system to reduce waiting time between items
  • Navigation history tracking to allow backward/forward movement through viewed items
  • Keyboard shortcuts for efficient browsing without requiring mouse interaction
  • Direct integration with the system's file manager for alternative viewing options

The browser evolved through multiple versions, from a simple external file opener to a sophisticated media viewer with integrated playback capabilities, demonstrating the iterative development process.

Results & Impact

The completed Social Media Archive Project delivered significant practical benefits while also advancing my technical skills across multiple domains. The system successfully achieved its core objectives while providing valuable insights into automated data extraction, machine learning implementation, and user interface design.

5,000+
Media Files Preserved
Created a comprehensive personal archive of media content from various social platforms and websites that would otherwise be vulnerable to platform changes or removal.
98%
Classification Accuracy
The machine learning component achieved high accuracy in content categorization, creating effectively organized collections.
30Ă—
Faster Processing with GPU
GPU acceleration reduced content classification time from hours to minutes compared to CPU-only processing.

Qualitative Outcomes

Beyond the quantitative metrics, this project achieved several important qualitative benefits:

  • Reliable Media Preservation: Successfully created a stable archive of personally valuable content from multiple platforms and websites that is now protected from the risks of content removal or account suspension.
  • Enhanced Browsing Experience: The custom media browser provided a significantly more efficient and enjoyable way to navigate the media collection compared to standard file explorers.
  • Practical Machine Learning Application: Demonstrated the practical value of machine learning for content organization by implementing an automated classification system that saved countless hours of manual sorting.
  • Cross-Platform Integration: Successfully bridged technologies across macOS and Windows environments, leveraging the strengths of each platform for different aspects of the solution.

"This project exemplifies how technical skills can be applied to create practical solutions for personal needs. By combining web automation, machine learning, and custom interface development, I created a system that not only preserves valuable content but makes it more accessible and organized than the original source platform."

— Personal reflection on the project's impact

Reflection & Learnings

This project provided valuable lessons across multiple domains, from web automation and machine learning to user interface design and cross-platform development. The challenges encountered and solutions developed have significantly enhanced my technical capabilities and problem-solving approach.

What Worked Well

  • Leveraging the "OldTwitter" extension dramatically simplified HTML structure, making web scraping more reliable and maintainable
  • Using pre-trained machine learning models through a .NET provided sophisticated classification capabilities without requiring extensive ML expertise
  • Implementing a modular architecture with specialized components for each phase allowed for targeted optimization and refinement
  • Transitioning to JDownloader for bulk downloading improved efficiency once the initial custom scripts validated the approach
  • Creating a custom media browser with integrated playback significantly enhanced the user experience compared to standard file explorers

Challenges & Solutions

  • Twitter's complex dynamic HTML initially posed significant scraping challenges, which were overcome by discovering and utilizing the "OldTwitter" extension
  • GPU acceleration setup required multiple attempts with different CUDA/cuDNN versions before finding a compatible configuration (CUDA 10.1 / cuDNN 7.6.x)
  • Cross-platform development issues were solved by creating a workflow that used each operating system for its strengths—macOS for initial development and Windows for GPU-accelerated machine learning
  • Media format diversity complicated the browser development, addressed by implementing specialized handlers for different file types with consistent UI
  • Performance issues with large collections were resolved through implementation of preloading and optimized rendering techniques

Future Considerations

  • Further refining the multi-platform approach to target specific content types from each source
  • Implementing more sophisticated classification categories beyond the current binary approach
  • Adding natural language processing to analyze and categorize text content from posts
  • Creating a unified interface that integrates all components into a single application
  • Exploring cloud-based processing options for improved scalability and performance
  • Implementing facial recognition to organize media by individuals appearing in the content

Personal Takeaway

This project reinforced my belief in the value of applying technical skills to solve personal challenges. What began as a simple desire to preserve content evolved into a comprehensive exploration of multiple technologies and approaches. The most satisfying aspect was seeing how different domains—web automation, machine learning, and UI development—could be integrated to create something greater than the sum of its parts. This experience has not only enhanced my technical toolkit but also strengthened my confidence in tackling complex, multi-faceted problems.