Note: This site is currently "Under construction". I'm migrating to a new version of my site building software. Lots of things are in a state of disrepair as a result (for example, footnote links aren't working). It's all part of the process of building in public. Most things should still be readable though.

Automatically Determine The Number Of Columns In An HTML Table With Python's Beautiful Soup

This is what I use to find the number of columns in an HTML table.

Code

#!/usr/bin/env python3

from typing import Counter
from bs4 import BeautifulSoup


max_cols = 0

with open("source/1.html") as _in:
    soup = BeautifulSoup(_in.read(), 'html.parser')
    table = soup.find("table", "complex")
    rows = table.find_all("tr")
    for row in rows:
        max_tds = 0
        tds = row.find_all("td")
        max_tds = max([max_tds, len(tds)])
        max_ths = 0
        ths = row.find_all("th")
        max_ths = max([max_ths, len(ths)])
        max_cols = max([max_cols, (max_ths + max_tds)])
    print(f"Max columns: {max_cols}")

Details

I'm working on an ascii art tool. It includes a full set of unicode characters to choose from. I pulled the characters from the W3C site. There are 28 pages with tables that sort everything from the top down then column by column. Something like this:

a f k
b g l
c h m
d i n
e j o

They're set up that way based of their unicode ID numbers. Makes sense on the W3C pages, but I want them sorted continuously from left to right.

a b c d e f g h

i j k l m n o p

I'm parsing the source HTML in Beautiful Soup then doing the formatting conversion in Pandas. I want to know the number of columns in the tables to setup the Pandas data frame explicitly. So, I setup the code snippet to figure that out. It loops through every row of the table and counts the number of `th`` (header) and `td`` (data) cells on each row then runs them through max functions to come up with the longest row.