Preserving and intelligently organizing media content using data scraping and machine learning
Social media platforms constantly evolve, and unfortunately, that often means content disappears—whether through policy changes, account suspensions, or platform restructuring. This project was born from the need to reliably preserve media content I had engaged with and wanted to reference later across multiple platforms. It started with Twitter, where content takedowns are common, but quickly expanded to include numerous other websites and social media sources.
The project evolved into a comprehensive media management system with three key components: an automated data scraping solution to extract and download media, a machine learning-powered classification system to intelligently categorize the downloaded content, and a custom media browser for efficient navigation of the resulting archive.
What began as a simple archiving tool transformed into an exploration of web automation, machine learning implementation, and graphical user interface development across multiple programming languages and technology stacks.
Create a system to archive, intelligently categorize, and efficiently browse personally curated social media content.
3 months (July - September 2023), with iterative improvements throughout the process.
Full-stack developer handling all aspects: architecture, data extraction, machine learning implementation, and UI development.
PowerShell, Selenium WebDriver, C#/.NET, TensorFlow, Python, Tkinter, OpenCV, JDownloader
The project presented several significant technical and architectural challenges that required innovative solutions across different technology stacks:
I created a multi-stage system that addressed these challenges through specialized components:
The development process unfolded in distinct phases, each requiring different technical approaches and problem-solving strategies. I employed an iterative methodology, testing and refining each component before moving to the next stage.
The initial phase focused on solving the fundamental challenge of extracting media from Twitter as my first target platform. I explored several approaches before landing on a PowerShell + Selenium solution. A critical breakthrough came when I discovered the "OldTwitter" browser extension, which significantly simplified the HTML structure and made scraping more reliable.
The script evolution involved multiple iterations to handle various edge cases:
After successfully extracting media, I needed a way to intelligently categorize it. I evaluated several machine learning approaches and ultimately chose to build a C# application leveraging a content categorization .NET library, which provided a pre-trained model specifically designed for content classification.
This phase involved significant technical challenges:
The final phase focused on creating a user-friendly way to browse and interact with the categorized media collection. I developed a Python application using Tkinter for the GUI and integrated multiple media-handling libraries for a seamless viewing experience.
The viewer evolved from a simple file opener to a sophisticated media browser with features like:
Screenshot of the custom media browser UI built with Python and Tkinter
With the core components developed, I focused on integrating them into a cohesive workflow and refining the entire system. This included:
This phase was crucial for ensuring the components worked together seamlessly and could handle the volume of media being processed.
The technical implementation of this project spanned multiple programming languages and frameworks, each chosen for its specific strengths in addressing different aspects of the overall solution.
The data extraction script was built in PowerShell using Selenium WebDriver to automate browser interactions. Key technical features included:
The script was designed to be resilient to network issues and could be paused and resumed without losing progress, making it suitable for processing large collections over extended periods. While these scripts were Twitter-specific, they were an important starting point before I discovered JDownloader's capabilities for multi-platform collection.
The content classification component was implemented in C# using a content categorization .NET library, which leverages TensorFlow for machine learning capabilities. Technical highlights included:
This component demonstrated significant performance improvements when running on a Windows system with an NVIDIA GTX 1080 Ti GPU compared to CPU-only processing, reducing classification time from hours to minutes for large collections.
The custom media browser was developed in Python using Tkinter for the GUI, with PIL/Pillow and OpenCV for media processing. Key technical implementations included:
The browser evolved through multiple versions, from a simple external file opener to a sophisticated media viewer with integrated playback capabilities, demonstrating the iterative development process.
The completed Social Media Archive Project delivered significant practical benefits while also advancing my technical skills across multiple domains. The system successfully achieved its core objectives while providing valuable insights into automated data extraction, machine learning implementation, and user interface design.
Beyond the quantitative metrics, this project achieved several important qualitative benefits:
"This project exemplifies how technical skills can be applied to create practical solutions for personal needs. By combining web automation, machine learning, and custom interface development, I created a system that not only preserves valuable content but makes it more accessible and organized than the original source platform."
— Personal reflection on the project's impact
This project provided valuable lessons across multiple domains, from web automation and machine learning to user interface design and cross-platform development. The challenges encountered and solutions developed have significantly enhanced my technical capabilities and problem-solving approach.
This project reinforced my belief in the value of applying technical skills to solve personal challenges. What began as a simple desire to preserve content evolved into a comprehensive exploration of multiple technologies and approaches. The most satisfying aspect was seeing how different domains—web automation, machine learning, and UI development—could be integrated to create something greater than the sum of its parts. This experience has not only enhanced my technical toolkit but also strengthened my confidence in tackling complex, multi-faceted problems.