Jack Moore

Email: jack(at)jmoore53.com
Project Updates

Some sort of kanboard plugin to save the external links as a scraped site to provide a preview

01 Dec 2023 » code, infrastructure, docker

So this one was never finished off, but this is where I got with the Selenium Scraper. The reason was because I now have multiple dead links I wish I had.

Using Python Selenium and Docker

I chose Python Selenium for web scraping due to its robust capabilities in interacting with web pages, combined with Docker to easily manage and run Selenium.

Docker Commands:

First, we need to run the Selenium Docker container:

docker run -d -p 4444:4444 -p 7900:7900 --shm-size="2g" selenium/standalone-firefox:4.16.1-20231219
  • Port 4444: This is used by the Selenium WebDriver.
  • Port 7900: This allows you to view what’s happening in real time through a web-based VNC client.

Scraping the Data with Selenium:

Here’s a basic Python script (app.py) using Selenium to access a webpage and fetch its content:

from selenium import webdriver

options = webdriver.FirefoxOptions()
options.page_load_strategy = 'normal'
options.add_argument('--ignore-ssl-errors=yes')
options.add_argument('--ignore-certificate-errors')
# options.add_argument('--headless')  # Uncomment this line to run in headless mode
# options.add_argument('--no-sandbox') # Uncomment this line if needed for sandbox issues

driver = webdriver.Remote(
    command_executor="http://localhost:4444/wd/hub",
    options=options
)

driver.get("https://jmoore53.com")
page_source = driver.page_source
print(page_source)

driver.quit()

Setting Up WebDriver Options:

I configure Firefox options to handle SSL errors and certificate issues, which is important for accessing a variety of websites without interruptions. The script navigates to the desired URL and retrieves the HTML source of the page. It was also important to always ensure to close the WebDriver session with driver.quit() to free up resources. I ran into an issue where if the driver wasn’t closed, the next open would lock.

Integrate this functionality into Kanboard:

Database Interaction: Use Docker and MySQL to interact with Kanboard’s database and fetch the URLs from task_has_external_links.

docker exec -it kanboard-db-1 /bin/mysql -u root -psecret -e 'use kanboard; select url from task_has_external_links;'

Automating the Scrape and Save

My goal going forward was to create a PHP plugin that triggers the Selenium script for each fetched URL, scraping the content and saving it to your database as a blob for easy retrieval and manipulation.

Enhancing the Scraped Data:

To refine the scraped data, you might consider using BeautifulSoup, a Python library for parsing HTML and XML documents, to extract just the relevant body content from the saved HTML.

© Jack Moore - This site was last built Fri 30 Aug 2024 12:31:24 PM EDT