Parsing HTML with Python
My latest Python project was parsing a large HTML file to extract specific information – specifically, my Facebook history which I downloaded when I was once again thinking about dumping the account. I’ve been on the site since 2009 and, while I’ve always been careful about what information I put out there, I wanted to get a real sense of what my posting history revealed to the algorithm.
The full download came to almost 1.4 GB; I’ve been pretty busy over the years sharing things and enjoying the conversation. The actual download is broken down into a collection of HTML web pages that you can browse through to see the information that you want. There’s a main index file called “start_here.html” that works as a menu.
The actual content is distributed across different pages in a number of directories. I was mostly interested in the text of the posts and comments that I’d made over the years. The archive includes images like memes and photos but a wholesale examination of those would take a bit more work than I wanted to get into and I tend to put a lot into words rather than images. Finally, the data I needed was spread across four files.
\comments_and_reactions\comments.html
\posts\your_posts__check_ins__photos_and_videos_1.html
\groups\group_posts_and_comments.html
\groups\your_comments_in_groups.html
Altogether, these files in my archive came to over 15 MB. I wanted to extract just the text of my own posts and comments and the dates on which they’d been made into smaller text files. I didn’t include private messages.
I probably could have gotten ChatGPT to handle this for me or at least write the code to do it but there’s no real fun in that and I wanted to tackle this myself. I went back to the Udemy course The Complete Python Bootcamp by Jose Portilla, dug into the section on web scraping and learned how to use the BeautifulSoup HTML parser to grab the information I wanted.
BeautifulSoup can target specific text in a file or it can target by CSS classes, IDs and tags. After looking at the files for a bit, I saw that all content is enclosed in a large hierarcy of HTML divs with a class called “_2pin” to mark the post and comment text and another called “_a72d” to mark the date. I thought a simple ouptut file showing the date and then the comment in chronological order would work fine.
You can find the full script I used on Github but the main portion isn’t that complicated.
# Move through the file looking for matches.
# Print to screen and output to file if specified.
with open(sys.argv[1], "r", encoding='utf-8') as f:
htmlsource = f.read()
pagecontent = bs4.BeautifulSoup(htmlsource, "html.parser")
for postDiv in pagecontent.find_all("div", class_=searchValues):
for innerDiv in postDiv:
if len(innerDiv.getText()) > 0:
print(innerDiv.getText())
if outputFile != None:
out.write(innerDiv.getText() + "\n")
The main script accepts the input file as a command line argument along with an optional output file. The search values are contained in a Python list object that can be changed at the top of the script. Despite the size of the files, it only took a few seconds to run for each of them and the output files were almost exactly what I wanted.
That’s still a lot of information to look through so that’s when I decided to call in A.I.. NotebookLM is a great tool for analyzing large amounts of information. You can create notebooks around specific subjects and generate study resources like flash cards, quizes and mind maps and, of course, the audio and video overviews that became so popular. NotebookLM will also answer specific questions based on the collected materials.
Of course, this did mean putting the information into my Google account but I figure there’s probably nothing there they don’t have already in one way or another. If you choose to parse your own archive, please upload it with care, keep the location private and consider deleting it after you’ve finished analyzing it.
When I saw some of the information that could be inferred about me, I was a little surprised until I remembered some of the conversations with friends outside of my actual posts. Yeah, Facebook keeps those, too. Fortunately, there’s nothing damaging there, just stuff that might make me more of a target for various ads which FB certainly has done more and more lately. Nevertheless, I will be cutting down my activity on Facebook even further. This analysis also didn’t include things like hitting the like button, page follows and group memberships. All of that is included in the archive and could be added with a longer script.
Sign up for our newsletter to receive updates about new projects, including the new book "Self-Guided SQL"!
We respect your privacy and will never share your information with third-parties. See our privacy policy for more information.




