Sitemap
A list of all the posts and pages found on the site. For you robots out there, there is an XML version available for digesting as well.
Pages
Posts
[gem5 Q&A] Why there is miss prediction of non-control instructions
Published:
Hello, In function checkSignalsAndUpdate(ThreadID tid) in src/cpu/o3/fetch.cc file, it seems miss prediction can still happen from commit and decode even if mispredictInst->isControl() is false.
[gem5 Q&A] Page Walker: Where the PTE hits in the memory hierarchy
Published:
Hi, I am working on the x86 page walker in gem5. I understand that the page walker accesses the page walker cache (PWC) first and, in case of a miss, it accesses the memory hierarchy (L1, then L2, then L3 caches and lastly the memory). This happens through the packetpointer read, which reads the physical address of the entry at each level (PML4, PDP.. etc.).
[gem5 tech mark] Why are stores in the SQ assumed to have valid addresses?
Published:
Hi, I’m doing some gem5 hacking for research and have been confused over the timing of when loads search the store queue (SQ) and when stores have valid addresses that can be compared against. Gem5 includes an assert in the read() method in the LSQ unit that the addresses of all stores before the executing loads are valid, but I don’t understand how this can be guaranteed in OoO execution.
[gem5 Q&A] Fixed I/O Address Range in x86
Published:
Hi all, I’m trying to model the SPEC HPC benchmark suite in gem5 with an x86 ISA using KVM. As a result, I am trying to link the “_addr” version of the m5ops against the binaries in order to model the region of interest. Unfortunately, I get the following error when trying to build the sample hello world example:
[gem5 Q&A] Microcode_ROM Instruction and fetchRomMicroop() Function
Published:
Hello, I am looking at the AtomicSimpleCPU code in src/cpu/simple for x86 ISA. I am trying to understand the following code snippet. Whenever this condition is true for a given PC, it does NOT follow the regular fetch from the instruction cache and then decode. This results in a macroop called Microcode_ROM
, which is not an x86 macroop that has a sequence of uops (can be seen in the O3 CPU). Example: Instruction is: Microcode_ROM : ldst t0, HS:[t0 + t6 + 0x20] (This is taken from the O3 logs running the same workload by checking the same PC in the Debug logs).
[gem5 Q&A] Squashing Instructions after Page Table Fault
Published:
Hello, I am currently trying to locate the code that is used to squash instructions if a Page Table Fault is triggered in the O3 CPU. After using the PageTableWalker Debug Flags, my current guess would be gem5/src/arch/x86/pagetable_walker.cc in line 199. Furthermore I inspected the files in the src/cpu/o3 directory, but couldn’t find anything specific to squashing instructions after a fault.
Is my assumption correct, that the O3 CPU implementation does not handle these things on its own, but the architectural part of the implementation does it? I am missing something, feel free to point it out.
portfolio
Portfolio item number 1
Short description of portfolio item number 1
Portfolio item number 2
Short description of portfolio item number 2
publications
Fuzzy Flow Regulation for Network-on-Chip based Chip Multiprocessors Systems
Published in 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC), 2014
Delays;Regulators;Fuzzy logic;Network-on-chip;Throughput;Pragmatics;Multiprocessor interconnection;Network-on-Chip;Chip Multiprocessor;Flow Regulation;Fuzzy Logic
Recommended citation: Y. Yao and Z. Lu, "Fuzzy Flow Regulation for Network-on-Chip based Chip Multiprocessors systems," 2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC), Singapore, 2014, pp. 343-348, doi: 10.1109/ASPDAC.2014.6742913.
Download Paper </article> </div>
Towards Stochastic Delay Bound Analysis for Network-on-Chip
Published in 2014 Eighth IEEE/ACM International Symposium on Networks-on-Chip (NoCS), 2014
Stochastic processes;Delays;Interference;Calculus;Analytical models;Servers;System-on-chip[<35;31;32M
Recommended citation: Z. Lu, Y. Yao and Y. Jiang, "Towards stochastic delay bound analysis for Network-on-Chip," 2014 Eighth IEEE/ACM International Symposium on Networks-on-Chip (NoCS), Ferrara, Italy, 2014, pp. 64-71, doi: 10.1109/NOCS.2014.7008763.
Download Paper </article> </div>
DVFS for NoCs in CMPs: A Thread Voting Approach
Published in 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2016
DVFS, Multi-core
Top conference publication-HPCA
Recommended citation: Y. Yao and Z. Lu, "DVFS for NoCs in CMPs: A thread voting approach," 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA), Barcelona, Spain, 2016, pp. 309-320, doi: 10.1109/HPCA.2016.7446074.
Download Paper </article> </div>
Memory-Access Aware DVFS for Network-on-Chip in CMPs
Published in 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2016
Switches;Resource management;Delays;Load modeling;Nickel;Tuning;Benchmark testing
Recommended citation: Y. Yao and Z. Lu, "Memory-access aware DVFS for network-on-chip in CMPs," 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), Dresden, Germany, 2016, pp. 1433-1436.
Download Paper </article> </div>
Opportunistic Competition Overhead Reduction for Expediting Critical Section in NoC Based CMPs
Published in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016
Critical Section; CMP; NoC; OS
Top conference publication-ISCA
Recommended citation: Y. Yao and Z. Lu, "Opportunistic Competition Overhead Reduction for Expediting Critical Section in NoC Based CMPs," 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, Korea (South), 2016, pp. 279-290, doi: 10.1109/ISCA.2016.33.
Download Paper </article> </div>
Aggregate Flow-Based Performance Fairness in CMPs
Published in ACM Transactions on Architecture and Code Optimization (TACO), Volume 13, Issue 4 Article No.: 53, Pages 1 - 27, 2016
computer architecture, performance fairness, quality of service
Recommended citation: Zhonghai Lu and Yuan Yao. 2016. Aggregate Flow-Based Performance Fairness in CMPs. ACM Trans. Archit. Code Optim. 13, 4, Article 53 (December 2016), 27 pages. https://doi.org/10.1145/3014429
Download Paper </article> </div>
Dynamic Traffic Regulation in NoC-Based Systems
Published in IEEE Transactions on Very Large Scale Integration (VLSI) Systems (Volume: 25, Issue: 2, February 2017) , 2017
Delays;System performance;IP networks;Nickel;Regulators;Calculus;Network-on-chip;Chip multi/many-core processor (CMP);fuzzy control;multi/many-processor systems-on-chip (MPSoC);network-on-chip (NoC);traffic engineering
Recommended citation: Z. Lu and Y. Yao, "Dynamic Traffic Regulation in NoC-Based Systems," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 25, no. 2, pp. 556-569, Feb. 2017, doi: 10.1109/
Download Paper </article> </div>
Marginal Performance: Formalizing and Quantifying Power Over/Under Provisioning in NoC DVFS
Published in IEEE Transactions on Computers (Volume: 66, Issue: 11, 01 November 2017) , 2017
Energy efficiency;Power demand;Measurement;Benchmark testing;Energy efficiency;Network-on-chip;Program processors;Performance evaluation;power efficiency;DVFS;network-on-chip (NoC);CMP
Top transaction publication-TC
Recommended citation: Z. Lu and Y. Yao, "Marginal Performance: Formalizing and Quantifying Power Over/Under Provisioning in NoC DVFS," in IEEE Transactions on Computers, vol. 66, no. 11, pp. 1903-1917, 1 Nov. 2017, doi: 10.1109/TC.2017.2715018.
Download Paper </article> </div>
iNPG: Accelerating Critical Section Access with In-network Packet Generation for NoC Based Many-Cores
Published in 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2018
Instruction sets;Spinning;Liquid crystal on silicon;Coherence;Acceleration;Routing protocols;In Network Packet Generation;Critical Section;Synchronisation Primitive;Cache Coherency;Network on Chip;CMP
Top conference publication-HPCA
Best-paper candidate
Recommended citation: Y. Yao and Z. Lu, "iNPG: Accelerating Critical Section Access with In-network Packet Generation for NoC Based Many-Cores," 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), Vienna, 2018, pp. 15-26, doi: 10.1109/HPCA.2018.00012.
Download Paper </article> </div>
Thread Voting DVFS for Manycore NoCs
Published in IEEE Transactions on Computers ( Volume: 67, Issue: 10, 01 October 2018) , 2018
Measurement;Message systems;System-on-chip;Instruction sets;Voltage control;Load modeling;Power system management;Chip manycore processor (CMP);DVFS;network on chip (NoC);power/energy efficiency
Top conference publication-TC
Recommended citation: Z. Lu and Y. Yao, "Thread Voting DVFS for Manycore NoCs," in IEEE Transactions on Computers, vol. 67, no. 10, pp. 1506-1524, 1 Oct. 2018, doi: 10.1109/TC.2018.2827039.
Download Paper </article> </div>
Pursuing Extreme Power Efficiency With PPCC Guided NoC DVFS
Published in IEEE Transactions on Computers (Volume: 69, Issue: 3, 01 March 2020) , 2020
Power demand;Message systems;Tuning;Thermal management;Monitoring;Energy consumption;Power system management;Manycore processor;DVFS;NoC;power efficiency;CMP
Top transaction publication-TC
Recommended citation: Y. Yao and Z. Lu, "Pursuing Extreme Power Efficiency With PPCC Guided NoC DVFS," in IEEE Transactions on Computers, vol. 69, no. 3, pp. 410-426, 1 March 2020, doi: 10.1109/TC.2019.2949807.
Download Paper </article> </div>
TSOPER: Efficient Coherence-Based Strict Persistency
Published in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2021
Protocols;Program processors;Nonvolatile memory;Computational modeling;Semantics;Coherence;Computer architecture;non-volatile memory;persistent memory;persistency;total store order;coherence
Top conference publication-HPCA
I am co-first author
Recommended citation: P. Ekemark, Y. Yao, A. Ros, K. Sagonas and S. Kaxiras, "TSOPER: Efficient Coherence-Based Strict Persistency," 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Seoul, Korea (South), 2021, pp. 125-138, doi: 10.1109/HPCA51647.2021.00021.
Download Paper </article> </div>
Game-of-Life Temperature-Aware DVFS Strategy for Tile-based Chip Many-Core Processor
Published in IEEE Journal on Emerging and Selected Topics in Circuits and Systems (JETCAS, Volume: 13, Issue: 1, March 2023), 2023
Dynamic voltage scaling, multiprocessor interconnection, automata.
Recommended citation: Y. Yao, "Game-of-Life Temperature-Aware DVFS Strategy for Tile-Based Chip Many-Core Processors," in IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 13, no. 1, pp. 58-72, March 2023, doi: 10.1109/JETCAS.2023.3244763
Download Paper </article> </div>
SE-CNN: Convolution Neural Network Acceleration via Symbolic Value Prediction
Published in IEEE Journal on Emerging and Selected Topics in Circuits and Systems (JETCAS, Volume: 13, Issue: 1, March 2023), 2023
Artificial intelligence, artificial neural networks, AI accelerators.
Recommended citation: Y. Yao, "SE-CNN: Convolution Neural Network Acceleration via Symbolic Value Prediction," in IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 13, no. 1, pp. 73-85, March 2023, doi: 10.1109/JETCAS.2023.3244767.
Download Paper </article> </div>
Silent Stores in the Battery-less Internet of Things: A Good Idea?
Published in Proceedings of the 2023 International Conference on Embedded Wireless Systems and Networks (EWSN), 2023
sensor network;store buffer;silent store;low power devices
Recommended citation: Weining Song, Stefanos Kaxiras, Luca Mottola, Thiemo Voigt, and Yuan Yao, ''Silent Stores in the Battery-less Internet of Things: A Good Idea?'' in Proceedings of the 2023 International Conference on embedded Wireless Systems and Networks (EWSN), Association for Computing Machinery, New York, NY, USA, 40-45.
Download Paper </article> </div>
TaDA: Task Decoupling Architecture for the Battery-less Internet of Things
Published in Proceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems (SenSys), 2024
Task decoupling, Internet of Things (IoT), energy harvesting, intermittent computing
SenSys is a top conference in wireless sensors
Recommended citation: Weining Song, Stefanos Kaxiras, Thiemo Voigt, Yuan Yao, and Luca Mottola. 2024. TaDA: Task Decoupling Architecture for the Battery-less Internet of Things. In Proceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems (SenSys). Association for Computing Machinery, New York, NY, USA, 409–421. https://doi.org/10.1145/3666025.3699347
Download Paper </article> </div>
TangramFP: Energy-Efficient, Bit-Parallel, Multiply-Accumulate for Deep Neural Networks
Published in 2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), 2024
Bit-Parallel, Energy-Efficient Multiply Accumulate, Deep Neural Networks
Best paper candidate
Recommended citation: Y. Yao, X. Chen, H. Atmer and S. Kaxiras, "TangramFP: Energy-Efficient, Bit-Parallel, Multiply-Accumulate for Deep Neural Networks," 2024 IEEE 36th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Hilo, HI, USA, 2024, pp. 1-12, doi: 10.1109/SBAC-PAD63648.2024.00009.
Download Paper </article> </div>
RXT: RefleXive address Translation for Pointer-Chasing Workloads
Published in 39th IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2025
Early acceptance paper [To be appeared]
Recommended citation: R. Aligholipour, P. Aimoniotis, S. Kaxiras and Y. Yao, "RXT: RefleXive address Translation for Pointer-Chasing Workloads," 39th IEEE International Parallel & Distributed Processing Symposium (IPDPS), Milan, Italy, 2025.
Download Paper </article> </div>
talks
Talk 1 on Relevant Topic in Your Field
Published:
This is a description of your talk, which is a markdown file that can be all markdown-ified like any other post. Yay markdown!
teaching
Computer Architecture I
Undergraduate course, Uppsala University, Department of IT, 2024
TL;DR Introduction to Computer Architecture using MIPS ISA and introduces ideas/thoughts behind the micro-architecture that implements the ISA.
I have been the course responsible since 2021
Course’s webpage at Uppsala University
Accelerating Systems with Programmable Logic Components
Graduate course, Uppsala University, Department of IT, 2024
TL;DR Join us to gain hands-on experience in FPGA acceleration, neural network optimization, and hardware-software co-design, while mastering the Xilinx Zynq-7000 FPGA system! 🚀
I have been the course responsible since 2020.
Course’s webpage at Uppsala University
tools
Statically-linked PARSEC-3.0 benchmarks
Published:
In this project I modified PARSEC-3.0 benchmarks using static linking with gem5 hooks for the x86_64 architecture.
Statically-linked WHISPER benchmarks
Published:
WHISPER benchmark suite with static linkage.
My config files for some gnu tools such as emacs, vim, etc
Published:
Some of my personal configurations for some of my personal favoriate GNU Tools
TangramFP
Published:
TangramFP: Energy-Efficient, Bit-Parallel Multiply-Accumulate for Deep Neural Networks