< Home < Research
This section describes specific research projects I've worked on which were made with the intent of being published or otherwise presented in some academic format. They involved collaborating with professors, PhD students, and undergraduates to create verifyable results that further the academic domain.
| Title | Date | Type | Status |
|---|---|---|---|
| Hardware and Software Optimizations for Deep Learning Workloads on Graphics Processing Units | April 2023 to November 2024 | Thesis, main author | Accepted |
| Using Transfer Learning and Supercells to Improve Graph Neural Network Performance in Formation Energy Predictions | May 2024 to current | Paper, main author | In progress |
| Applying Transfer Learning to Defect Graph Neural Networks for Defect Formation Energy Predictions | May 2024 to July 2024 | Poster, graduate assistant | Submitted, waiting |
| Deputy NoC: A Case of Low Cost Network-on-Chip for Neural Network Accelerations on GPUs | August 2023 to May 2024 | Paper, coauthor | Submitted, waiting |
| Elastic-Float: Lossy Cache Compression for Cost Effective Neural Network Acceleration | January 2023 to December 2023 | Paper, coauthor | Rejected, revising |
| Genomics-GPU: A Benchmark Suite for GPU-accelerated Genome Analysis | August 2022 to November 2022 | Paper, coauthor | Accepted, ISPASS 2023 |
| GPU Implementation of Image Recognition Neural Network Architectures | May 2022 to July 2022 | Poster, main author | Presented, UNT 2022 REU |
Hardware and Software Optimizations for Deep Learning Workloads on Graphics Processing Units
Status: Accepted
Timeline: April 2023 to November 2024
Advisor: Dr. Hui Zhao
For the 2d-convolution operation used frequently in image recognition, we can convert it into a matrix multiplication by duplicating the data in input/weight matrices in a specific way. Although the duplication makes the cache hierarchy less useful, since a cache miss might not have occurred if the mapping function was slightly different. We partition the L2 cache into "unique" and "normal" parts and use a new mapping function that minimizes duplicated values in the unique section if the passed address is within the bounds of the lowered matrix.
Specific contributions
- Made many different changes to the GPGPU-Sim simulator, including to the memory system, interconnect method, and timing simulator.
- Implemented motivational tests that show how much duplication was present at different timesteps of the program for different convolution layer sizes found throughout image recognition.
Using Transfer Learning and Supercells to Improve Graph Neural Network Performance in Formation Energy Predictions
Status: in-progress
Timeline: May 2024 to current
Advisor: Dr. Yuanxi Wang
Expands on previous work to use supercells from density functional theory to represent the defect crystal input to a graph neural network. Each crystal material is normally represented as a non-repeating "unit cell" with each atom as a node and each bond as an edge, while supercells contain repetitions of unit cells that are connected to each other with a pooling function that collates the nodes from different repeated instances. Uses transfer learning to support a much larger graph, avoiding a problem other work faced when trying to increase model size with such a small dataset.
Specific contributions
- Expanded CGCNN to train a network on supercells containing duplicated unit cells.
- Used creative learning strategies (including genetic hyperparameter optimizers and circular learning rate methods that work especially well for transfer learning) to obtain a higher performance.
- Worked with a physics professor to show how changes to the deep-learning model are still physically logical.
Applying Transfer Learning to Defect Graph Neural Networks for Defect Formation Energy Predictions
Status: submitted, waiting (poster, report, GitHub)
Timeline: May 2024 to July 2024
Advisor: Dr. Yuanxi Wang
Showed how transfer learning could be utilized to improve the performance of a graph neural network pretrained on pristine crystal structures and post-trained to predict the formation energies of defected crystal structures. The defect dataset is very small while the pristine dataset is comparatively a lot larger, so transfer learning was shown to improve the performance by a small amount. Worked with two undergraduate students during UNT's Summer 2024 Artificial Intelligence REU.
Specific contributions
- Led the undergraduate students and organized meetings to ensure the success of the project, assigning specific tasks for each student to complete and working with the professor and another graduate student regularly.
- Modified the CGCNN and dGNN source code to implement transfer learning.
- Collected figures that demonstrate the effects of modifying different hyperparameters (like layer size, number of layers, training method, etc.) on the final model performance.
Deputy NoC: A Case of Low Cost Network-on-Chip for Neural Network Accelerations on GPUs
Status: submitted, waiting
Timeline: August 2023 to May 2024
Advisor: Dr. Hui Zhao
Improves the power consumption and performance of deep neural networks by changing the GPU's network-on-chip to exploit data locality in the exponent field and only sending redundant exponents once, while using "deputy values" to further compress the mantissa field.
Specific contributions
- Uses the same Python and C++ projects below adapted with new functionality, especially in regards to the layout of the input and weight tensors for each convolution function.
- Made small changes to the GPU simulator and the configuration files with the help of the project lead.
- Organized tests and formatted data to be used as figures.
Elastic-Float: Lossy Cache Compression for Cost Effective Neural Network Acceleration
Status: rejected, revising
Timeline: January 2023 to December 2023
Advisor: Dr. Hui Zhao
Developed a cache compression method for deep neural networks that significantly lowers cache power and capacity without harming performance or accuracy by exploiting value locality in floating-point data. I maintained the Python and C++ projects that were used as benchmarks to measure the performance of the methods, along with collecting the data for some figures.
Specific contributions
- Adapted CUTLASS code written by a previous student to improve its versatility for our project.
- Deconstructed famous image-recognition networks like AlexNet and ResNet into base functions that we can run independently on our GPU simulator.
- Created Python code that executes each of these base functions, turning one or more images into a list of image classifications.
- Ran tests both locally and on the Texas Advanced Computing Center (TACC) Lonestar6 supercomputer.
Genomics-GPU: A Benchmark Suite for GPU-accelerated Genome Analysis
Status: published, ISPASS 2023
Timeline: August 2022 to November 2022
Advisor: Dr. Hui Zhao
Created a benchmark suite for studying the effects of microarchitectural changes to the GPU for genome analysis. I adapted the gene-clustering benchmark to use CUDA dynamic parallelism and obtained a 24.2% improvement in performance. I also ran several tests on the simulator to collect data for figures. This was my first experience assisting with a research project!
Specific contributions
- Refactored the general algorithm from the gene-clustering algorithm to use CUDA dynamic parallelism, allowing the program to complete in 24.2% of the time on average.
- Aided in the compilation and execution of benchmarks like GASAL2 on the GPU simulator by modifying the source code.
- Collected the data for simple tests under close supervision of the PhD students involved.
- Learned how GPGPU-Sim (and processor-simulators in general) work.
GPU Implementation of Image Recognition Neural Network Architectures
Status: presented, UNT 2022 REU (poster, report, GitHub)
Timeline: May 2022 to July 2022
Advisor: Dr. Hui Zhao
Worked with another undergraduate student to compare the performance difference in using the CPU versus the GPU in doing image classification with the AlexNet, ResNet, and VGG-16 architectures. This involved re-creating each operator in each model in CUDA (both with and without tensor-cores) and Python.
Specific contributions
- Collaborated with another student for the first time in a research setting.
- Implemented deep learning operators like 2D convolution, matrix multiplication, max pool, ReLU, and Softmax in Python (with Keras), C, and CUDA (with and without tensorcores).
- Recreated AlexNet, ResNet, and VGG-16 in each language by closely following thier papers.