/
2024-09-19 Meeting notes

2024-09-19 Meeting notes

 Date

Sep 19, 2024

 Participants

  • @Tyler Hill

  • @Josh

  •  

 Discussion topics

Item

Notes

Item

Notes

Overview

  • Josh will be adding more intro issues to the GitHub soon

  • Will be going through chromedp instead of continuing A Tour of Go

chromedp

in api-tools in main.go: runs all the scrapers. we’ll be looking at scrapeCoursebook

  • initChromeDp first, allocator sets up browser

  • RunResponse runs a set of actions on the context

    • Run would not be able to handle page navigation/changes, like what you trigger by clicking a button

  • use RunResponse to log in with your netID and password to scrape without being rate limited

    • ClearBrowserCookies (using ActionFunc which is like a custom action) to get a new token at the start of each new prefix which resets your rate limit

    • get headers from with cookies to use in later requests

  • make a new request to coursebooks’s behind-the-scenes url for each course (this reverse engineering makes scraping faster and easier, don’t have to click thru so many dropdowns/buttons)

    • and then for each section in each course

    • also refresh the token every 30 seconds, just in case

  • Currently coursebook scraping is not working because the scraper isn’t able to log in

Questions

  • How often do we scrape?

    • 3-4 times per semester for the latest semester as UTD makes changes occasionally

      • we could monitor a page that list changes and scrape then

  • Why do we have to scrape at all?

    • There’s supposed to be a way to ask for this data from the school but it’s full of red tape and never worked out for leadership 2 years ago

  • What happens when it fails?

    • Right now it just does nothing when the login button is pressed

chromedp docs overview

  • lots of types

  • Actions don’t have much in common, just the Do function

  • since we’re calling functions and passing their results into RunResponse you can use anything that returns an Action

  • query Actions: find things in the page that match a certain pattern

    • Query Options

      • search below a specific node in the DOM tree, for large pages

    • By Options

      • By__ is equivalent to an XPath query or a document.querySelector__() in JS

    • Node Options

      • check properties of an element: visible, disabled, …

Questions

  • Data pipeline

    • Scraper, parser, uploader

Next meetin

  • Will be all about MongoDB

 Decisions

 Action items

@Josh add new intro issues to GitHub Sep 26, 2024

Related content

2024-09-05 Meeting notes
2024-09-05 Meeting notes
More like this
2024-08-29 Meeting notes
2024-08-29 Meeting notes
More like this
Data Documentation
Data Documentation
More like this
2023-11-30 Meeting notes
2023-11-30 Meeting notes
More like this
Notes - Athena Requirements Collaboration
Notes - Athena Requirements Collaboration
More like this
Modular API's
Modular API's
More like this