Hey All!
It was great seeing everyone at Gulf Con a couple weeks ago! I am copying the notes for all of our Data Management Sessions below! Please reply to this thread or feel free to add any additional details here if there was something that I missed. Thanks!
Presentation 1 - GRIIDC: How an established repository evolves with changing technology and standards
- GRIIDC is a multidisciplinary data repository
- Making data publicly available through the HART institute at TAMUCC
- The team ensures everything is QA/QC
- This system is fee based to store, free to access
- If you are interested in using the system the fee covers data storage and help with planning for the data management
- DOIs are issued for each dataset and the team helps distribute it
- Gulf wide organizations are included in the repository
- 3614 datasets
- Total of 172 TB
- 50,000+ downloads
- GoMRI was the first organization to require open data, now everyone is requiring that you share data or make it open
- FAIR guiding principles - Findable, accessible, interoperable, reuseable
- Searchable
- Metadata
- Google dataset search
- Standardize keywords
- Collects POC for each dataset
- Follows ISO 19115-2 metadata standards
- TRUST principles - Transparency, responsibility, user focus, sustainability, technology
- GRIIDC uses 2 virtual machines that host the data
- When datasets >25 GB they are stored on AWS
- Data backups are on all systems so no data is lost
- GRIIDC has devised their own standardized keywords
- The GRIIDC monitoring page has been updated
- There is a tracking status of data in the system that can be used by a data submitter or a user to see where the data is during the upload process
- This is also where a user can download the report about the data from this page
- There is also a map search option so that a user can see what data is available in the system in a specific area
- Challenges
- Some of the metadata attributes
- Keywords β have to be backfilled as time goes on
- There are free text areas that need to be standardized
- Large datasets β curation takes time (curation= filling out the metadata information and making sure its human readable)
- Storing data costs money
- Takes time to curate the data
- Compliance
- Researchers have to know the rules of inputting metadata and what the requirements are
- Data has to be QA/QCed
- Collaboration
- Data repositories
- Researchers
- Need buy in from the researchers to achieve curation
- Funding agencies
- Need to create the timelines for funding data archiving into repositories
Presentation 2 Part 1 - A New Coastal Data Ecosystem: How Floridaβs Seafloor Mapping Initiative is Meeting Diverse User Needs
- Two phase data collection
- LiDAR to collect shallow data
- 20-40 m deep
- Multibeam for anything deeper
- LiDAR to collect shallow data
- Prioritization
- The team gave out a group of tokens to researchers that were placed along the coast to see where people wanted to collect data
- Sites were then chosen based on highest selected sites
- Lidar - 75,595 km2
- LiDAR point cloud
- Bare Earth DEM
- GIS data
- Mapping reports
- Sonar β 64,382 km2
- Sonar point cloud
- Bathymetric attributes
- Reports
- Use case β aircraft carrier was sunk off the coast of Florida
- Largest off coast artificial reef in the world
- Currently a larger ship is being prepped to be sunk that will become the largest artificial reef in the world
- Challenges
- Funds were received by 2024
- Funds need to be spent by 2026
- All Lidar data has been collected
- This coastal mapping program is just part of the work that is being done by the FDEP team
- Technical issues
- Hurricanes/ inclement weather
- Size of data to use/download β it is very large and takes a long time to download and a lot of power to use
- Lots of moving pieces
Presentation 2 Part 2 - Topo bathymetric β integration workflow integration improvements
- Working with USGS to stitch the models together
- Really strong metadata that will show details of how and when the data was collected
- This was collected at the time the data was collected with some holes that will need to be filled during QA/QC
- There is a dashboard on the FDEP hub site that shows this work
- Data is going to be available next summer
- The project team wanted the inland bathymetry to be included in the maps
- Terrestrial and the bathymetry have been connected if there is terrestrial data available
- Florida GIO.gov
- Initiatives β maps are available along with data timelines
- Jimβs information is there as well
- This project has won several awards nationally and internationally
- Questions -
- Project off the coast of Destin β need to collaborate on where the break spot was (20-30 miles out, which is outside of the project area)
- Sediment identification β vendors coming in to do that or what are the next steps
- FSU coming in to do some backscatter data and would be doing some of that work
- NOAA has done a lot of the pan handle data
Presentation 3 - Fine-Tuned Large Language Models for Natech Analytics
- Environmental hazards β Air pollutions, water insecurity
- Corpus Christi recently experienced a water shortage
- This is important as there was a lack of quality, accessibility, and affordability of water
- Natechs β technological accidents (like natural disasters, but caused by technology centers)
- Unpermitted release of pollutants into neighborhoods
- Need to capture these unpredicted events that are not captured in traditional models
- There are no current models that look at Natechs
- The project team wanted to look to see if there was a correlation between natechs happening and natural disasters in countries
- In other words do countries with higher natural disasters tend to have higher natechs
- Looking at climate effects on localized environmental health disparities on overburdened communities
- LLM
- There are a lot of documents and reports that are available regarding the impacts of natural disasters and when Natechs happen
- A pretrained model can be used to look through this data
- First had to work out which prompts to use for the LLM to get outputs that would answer the questions the team was trying to answer
- To fine tune the LLM the project team used Meta AI
- Fine tuning got the accuracy of the LLM to 0.958 with a precision 0.957
- When a pretrained model had too much data which impacted the precision and accuracy
- What are the hazards that are being triggered
- Natech cause 6% of emission incidents
- Lightning strikes and freezing events could also trigger natechs
- Strong seasonal trends seem to trigger natechs impacts
- Meaning there is more preparedness that needs to happen during certain times
- Gulf Wide vs. Just Texas
- Hurricanes at the Gulf Wide view were a major trigger of natech events
- Gulf coast is facing a higher excessive emissions (14%) than just Texas
- The project team wanted to look to see if there was a correlation between natechs happening and natural disasters in countries
- Takeaways
- Fine tuned LLM β turns air emission narratives into structured natech analytics
- Two decadal analysis of Texas natechs
- AI-assisted Natech research and management
- Model what are the issues from climate disasters,
- how can we measure future air pollution/water insecurity,
- what are the health impacts of these things
- solutions need to be designed and implemented
- need data input from our local partners
- brought together Texas A&M University to help with the data collection
- This prototype is the starting point and can be applied more broadly