Abstract
Racetrack memories (RMs) have significantly evolved since their conception in 2008, making them a serious contender in the field of emerging memory technologies. Despite key technological advancements, the access latency and energy consumption of an RM-based system are still highly influenced by the number of shift operations. These operations are required to move bits to the right positions in the racetracks. This article presents data-placement techniques for RMs that maximize the likelihood that consecutive references access nearby memory locations at runtime, thereby minimizing the number of shifts. We present an integer linear programming (ILP) formulation for optimal data placement in RMs, and we revisit existing offset assignment heuristics, originally proposed for random-access memories. We introduce a novel heuristic tailored to a realistic RM and combine it with a genetic search to further improve the solution. We show a reduction in the number of shifts of up to 52.5%, outperforming the state of the art by up to 16.1%.
- Ehsan Atoofian. 2015. Reducing shift penalty in domain wall memory through register locality. In Proceedings of the International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES’15). IEEE Press, Piscataway, NJ, 177--186. Retrieved from http://dl.acm.org/citation.cfm?id=2830689.2830689.2830711.Google ScholarCross Ref
- Sunil Atri, J. Ramanujam, and Mahmut T. Kandemir. 2001. Improving offset assignment for embedded processors. In Proceedings of the 13th International Workshop on Languages and Compilers for Parallel Computing-Revised Papers (LCPC’00). Springer-Verlag, London, 158--172. Retrieved from http://dl.acm.org/citation.cfm?id=645678.663953.Google Scholar
- David H. Bartley. 1992. Optimizing stack frame accesses for processors with restricted addressing modes. Softw. Pract. Exper. 22, 2 (Feb. 1992), 101--110. DOI:https://doi.org/10.1002/spe.4380220202Google Scholar
- Brad Calder, Chandra Krintz, Simmi John, and Todd Austin. 1998. Cache-conscious data placement. SIGPLAN Not. 33, 11 (Oct. 1998), 139--149. DOI:https://doi.org/10.1145/291006.291036Google ScholarDigital Library
- Xianzhang Chen, Edwin Hsing-Mean Sha, Qingfeng Zhuge, Chun Jason Xue, Weiwen Jiang, and Yuangang Wang. 2016. Efficient data placement for improving data access performance on domain-wall memory. IEEE Trans. Very Large Scale Integr. Syst. 24, 10 (Oct. 2016), 3094--3104. DOI:https://doi.org/10.1109/TVLSI.2016.2537400Google ScholarDigital Library
- Sangyeun Cho and Hyunjin Lee. 2009. Flip-n-write: A simple deterministic technique to improve pram write performance, energy and endurance. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’09). ACM, New York, NY, 347--357. DOI:https://doi.org/10.1145/1669112.1669157Google ScholarDigital Library
- LLC Gurobi Optimization. 2018. Gurobi Optimizer Reference Manual. Retrieved from http://www.gurobi.com.Google Scholar
- F. Hameed, A. A. Khan, and J. Castrillon. 2018. Performance and energy-efficient design of STT-RAM last-level cache. IEEE Trans. Very Large Scale Integr. Syst. 26, 6 (June 2018), 1059--1072. DOI:https://doi.org/10.1109/TVLSI.2018.2804938Google ScholarCross Ref
- M. Hayashi, L. Thomas, C. Rettner, R. Moriya, Y. B. Bazaliy, and S. Parkin. 2007. Current driven domain wall velocities exceeding the spin angular momentum transfer rate in permalloy nanowires. Phys Rev Lett. 98, 3 (2007), 037204.Google ScholarCross Ref
- Mario Jino and Jane W. S. Liu. 1978. Intelligent magnetic bubble memories. In Proceedings of the 5th Annual Symposium on Computer Architecture (ISCA’78). ACM, 166--174.Google Scholar
- Michael Jünger and Sven Mallach. 2013. Solving the simple offset assignment problem as a traveling salesman. In Proceedings of the 16th International Workshop on Software and Compilers for Embedded Systems (M-SCOPES’13). ACM, New York, NY, 31--39. DOI:https://doi.org/10.1145/2463596.2463601Google ScholarDigital Library
- A. A. Khan, F. Hameed, R. Bläsing, S. Parkin, and J. Castrillon. 2019. RTSim: A cycle-accurate simulator for racetrack memories. IEEE Comput. Architect. Lett. 18, 1 (Jan. 2019), 43--46. DOI:https://doi.org/10.1109/LCA.2019.2899306Google ScholarCross Ref
- Asif Ali Khan, Norman A. Rink, Fazal Hameed, and Jeronimo Castrillon. 2019. Optimizing tensor contractions for embedded devices with racetrack memory scratch-pads. In Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES’19). ACM, New York, NY, 5--18. DOI:https://doi.org/10.1145/3316482.3326351Google ScholarDigital Library
- Hoda Aghaei Khouzani and Chengmo Yang. 2017. A DWM-based stack architecture implementation for energy harvesting systems. ACM Trans. Embed. Comput. Syst. 16, 5s (Sept. 2017). DOI:https://doi.org/10.1145/3126543Google ScholarDigital Library
- E. Kultursay, M. Kandemir, A. Sivasubramaniam, and O. Mutlu. 2013. Evaluating STT-RAM as an energy-efficient main memory alternative. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS’13). 256--267.Google Scholar
- Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger. 2009. Architecting phase change memory as a scalable dram alternative. SIGARCH Comput. Archit. News 37, 3 (June 2009), 2--13. DOI:https://doi.org/10.1145/1555815.1555758Google ScholarDigital Library
- B. C. Lee, P. Zhou, J. Yang, Y. Zhang, B. Zhao, E. Ipek, O. Mutlu, and D. Burger. 2010. Phase-change technology and the future of main memory. IEEE Micro 30, 1 (Jan 2010), 143--143. DOI:https://doi.org/10.1109/MM.2010.24Google ScholarDigital Library
- Rainer Leupers. 2003. Offset assignment showdown: Evaluation of DSP address code optimization algorithms. In Proceedings of the 12th International Conference on Compiler Construction (CC’03). Springer-Verlag, Berlin, 290--302. Retrieved from http://dl.acm.org/citation.cfm?id=1765931.1765960.Google ScholarDigital Library
- R. Leupers and F. David. 1998. A uniform optimization technique for offset assignment problems. In Proceedings of the 11th International Symposium on System Synthesis. 3--8. DOI:https://doi.org/10.1109/ISSS.1998.730589Google Scholar
- R. Leupers and P. Marwedel. 1996. Algorithms for address assignment in DSP code generation. In Proceedings of the International Conference on Computer Aided Design. 109--112. DOI:https://doi.org/10.1109/ICCAD.1996.569409Google ScholarCross Ref
- Qingan Li, Jianhua Li, Liang Shi, Chun Jason Xue, and Yanxiang He. 2012. MAC: Migration-aware compilation for STT-RAM-based hybrid cache in embedded systems. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED’12). ACM, New York, NY, 351--356. DOI:https://doi.org/10.1145/2333660.2333738Google ScholarDigital Library
- Q. Li, J. Li, L. Shi, M. Zhao, C. J. Xue, and Y. He. 2014. Compiler-assisted STT-RAM-based hybrid cache for energy efficient embedded systems. IEEE Trans. Very Large Scale Integr. Syst. 22, 8 (Aug. 2014), 1829--1840. DOI:https://doi.org/10.1109/TVLSI.2013.2278295Google ScholarCross Ref
- Y. Li, S. Ghose, J. Choi, J. Sun, H. Wang, and O. Mutlu. 2017. Utility-based hybrid memory management. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’17). 152--165. DOI:https://doi.org/10.1109/CLUSTER.2017.130Google Scholar
- Yun Liang and Shuo Wang. 2016. Performance-centric optimization for racetrack memory-based register file on GPUs. J. Comput. Sci. Technol. 31, 1 (Jan. 2016), 36--49.Google ScholarCross Ref
- Stan Liao, Srinivas Devadas, Kurt Keutzer, Steve Tjiang, and Albert Wang. 1995. Storage assignment to decrease code size. SIGPLAN Not. 30, 6 (June 1995), 186--195. DOI:https://doi.org/10.1145/223428.207139Google ScholarDigital Library
- Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’05). ACM, New York, NY, 190--200. DOI:https://doi.org/10.1145/1065010.1065034Google ScholarDigital Library
- Sven Mallach. 2015. More general optimal offset assignment. Leibniz Trans. Embed. Syst. 2, 1 (2015), 02--1--02:18. DOI:https://doi.org/10.4230/LITES-v002-i001-a002Google Scholar
- Sven Mallach and Roberto Castañeda Lozano. 2014. Optimal general offset assignment. In Proceedings of the 17th International Workshop on Software and Compilers for Embedded Systems (SCOPES’14). ACM, New York, NY, 50--59. DOI:https://doi.org/10.1145/2609248.2609251Google ScholarDigital Library
- H. Mao, C. Zhang, G. Sun, and J. Shu. 2015. Exploring data placement in racetrack memory-based scratchpad memory. In Proceedings of the IEEE Non-Volatile Memory System and Applications Symposium (NVMSA’15). 1--5. DOI:https://doi.org/10.1109/NVMSA.2015.7304358Google ScholarCross Ref
- M. Mao, W. Wen, Y. Zhang, Y. Chen, and H. Li. 2014. Exploration of GPGPU register file architecture using domain-wall-shift-write-based racetrack memory. In Proceedings of the 51st ACM/EDAC/IEEE Design Automation Conference (DAC’14). 1--6.Google Scholar
- I. Mihai Miron, T. Moore, H. Szambolics, L. Buda-Prejbeanu, S. Auffret, B. Rodmacq, S. Pizzini, J. Vogel, M. Bonfim, A. Schuhl, and G. Gaudin. 2011. Fast current-induced domain-wall motion controlled by the Rashba effect. Nat Mater. 10, 6 (2011), 419--23. DOI:10.1038/nmat3020Google ScholarCross Ref
- Sparsh Mittal and Jeffrey Vetter. 2015. A survey of software techniques for using non-volatile memories for storage and main memory systems. IEEE Trans. Parallel Distrib. Syst. 27 (Jan. 2015). DOI:https://doi.org/10.1109/TPDS.2015.2442980Google Scholar
- S. Mittal, J. S. Vetter, and D. Li. 2015. A survey of architectural approaches for managing embedded DRAM and non-volatile on-chip caches. IEEE Trans. Parallel Distrib. Syst. 26, 6 (June 2015), 1524--1537.Google ScholarDigital Library
- Joonas Multanen, Asif Ali Khan, Pekka Jääskeläinen, Fazal Hameed, and Jeronimo Castrillon. 2019. SHRIMP: Efficient instruction delivery with domain wall memory. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED’19). ACM, New York, NY.Google ScholarCross Ref
- E. Park, S. Yoo, S. Lee, and H. Li. 2014. Accelerating graph computation with racetrack memory and pointer-assisted graph representation. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE’14). 1--4. DOI:https://doi.org/10.7873/DATE.2014.172Google Scholar
- Stuart Parkin, Masamitsu Hayashi, and Luc Thomas. 2008. Magnetic domain-wall racetrack memory. Science 320 (2008), 5873, 190--194. DOI:10.1126/science.1145799Google Scholar
- Stuart Parkin and See-Hun Yang. 2015. Memory on the racetrack. Nat Nanotechnol. 10, 3 (March 2015), 195--198.Google ScholarCross Ref
- S. S. Parkin. 2004. Shiftable Magnetic Shift Register and Method of Using the Same. US patent 6834005B1.Google Scholar
- Ivy Bo Peng, Roberto Gioiosa, Gokcen Kestor, Pietro Cicotti, Erwin Laure, and Stefano Markidis. 2017. RTHMS: A tool for data placement on hybrid memory system. In Proceedings of the ACM SIGPLAN International Symposium on Memory Management (ISMM’17). ACM, New York, NY, 82--91. DOI:https://doi.org/10.1145/3092255.3092273Google ScholarDigital Library
- Moinuddin K. Qureshi, Vijayalakshmi Srinivasan, and Jude A. Rivers. 2009. Scalable high performance main memory system using phase-change memory technology. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA’09). ACM, New York, NY, 24--33. DOI:https://doi.org/10.1145/1555754.1555760Google Scholar
- Luiz E. Ramos, Eugene Gorbatov, and Ricardo Bianchini. 2011. Page placement in hybrid memory systems. In Proceedings of the International Conference on Supercomputing (ICS’11). ACM, New York, NY, 85--95. DOI:https://doi.org/10.1145/1995896.1995911Google ScholarDigital Library
- A. Ranjan, S. G. Ramasubramanian, R. Venkatesan, V. Pai, K. Roy, and A. Raghunathan. 2015. DyReCTape: A dynamically reconfigurable cache using domain wall memory tapes. In Proceedings of the Design, Automation Test in Europe Conference Exhibition (DATE’15). 181--186. DOI:https://doi.org/10.7873/DATE.2015.0838Google Scholar
- Silvius Rus, Lawrence Rauchwerger, and Jay Hoeflinger. 2003. Hybrid analysis: Static 8 dynamic memory reference analysis. Int. J. Parallel Program. 31, 4 (Aug. 2003), 251--283. DOI:https://doi.org/10.1023/A:1024597010150Google ScholarDigital Library
- K.-Su Ryu, L. Thomas, S-Hun Yang, and S. Parkin. 2013. Chiral spin torque at magnetic domain wall. Nat Nanotechnol. 8, 7 (2013), 527--33. DOI:10.1038/nnano.2013Google ScholarCross Ref
- H. Servat, A. J. Peña, G. Llort, E. Mercadal, H. Hoppe, and J. Labarta. 2017. Automating the application data placement in hybrid memory systems. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’17). 126--136. DOI:https://doi.org/10.1109/CLUSTER.2017.50Google Scholar
- Zhenyu Sun, Xiuyuan Bi, Alex K. Jones, and Hai Li. 2014. Design exploration of racetrack lower-level caches. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED’14). ACM, New York, NY, 263--266. DOI:https://doi.org/10.1145/2627369.2627651Google ScholarDigital Library
- Z. Sun, Wenqing Wu, and Hai Li. 2013. Cross-layer racetrack memory design for ultra-high density and low power consumption. In Proceedings of the 50th ACM/EDAC/IEEE Design Automation Conference (DAC’13). 1--6.Google ScholarDigital Library
- Rangharajan Venkatesan, Vivek Kozhikkottu, Charles Augustine, Arijit Raychowdhury, Kaushik Roy, and Anand Raghunathan. 2012. TapeCache: A high-density, energy-efficient cache based on domain wall memory. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED’12). ACM, New York, NY, 185--190. DOI:https://doi.org/10.1145/2333660.2333707Google ScholarDigital Library
- O. Voegeli, B. A. Calhoun, L. L. Rosier, and J. C. Slonczewski. 1975. The use of bubble lattices for information storage. AIP Conf. Proc. 24, 1 (1975), 617--619.Google ScholarCross Ref
- Shuo Wang, Yun Liang, Chao Zhang, Xiaolong Xie, Guangyu Sun, Yongpan Liu, Yu Wang, and Xiuhong Li. 2016. Performance-centric register file design for GPUs using racetrack memory. In Proceedings of the 21st Asia and South Pacific Design Automation Conference (ASP-DAC’16). 25--30. DOI:https://doi.org/10.1109/ASPDAC.2016.7427984Google Scholar
- Z. Wang, D. A. Jiménez, C. Xu, G. Sun, and Y. Xie. 2014. Adaptive placement and migration policy for an STT-RAM-based hybrid cache. In Proceedings of the IEEE 20th International Symposium on High Performance Computer Architecture (HPCA’14). 13--24. DOI:https://doi.org/10.1109/HPCA.2014.6835933Google Scholar
- Wei Wei, Dejun Jiang, Sally A. McKee, Jin Xiong, and Mingyu Chen. 2015. Exploiting program semantics to place data in hybrid memory. In Proceedings of the International Conference on Parallel Architecture and Compilation (PACT’15). IEEE Computer Society, Washington, DC, 163--173. DOI:https://doi.org/10.1109/PACT.2015.10Google ScholarDigital Library
- C. K. Wong and P. C. Yue. 1976. Data organization in magnetic bubble lattice files. IBM J. Res. Dev. 20, 6 (Nov. 1976), 576--581.Google ScholarDigital Library
- H. P. Wong, H. Lee, S. Yu, Y. Chen, Y. Wu, P. Chen, B. Lee, F. T. Chen, and M. Tsai. 2012. Metal-Oxide RRAM. Proc. IEEE 100, 6 (June 2012), 1951--1970. DOI:https://doi.org/10.1109/JPROC.2012.2190369Google ScholarCross Ref
- H.-S. Philip Wong, Simone Raoux, Sangbum Kim, Jiale Liang, John Reifenberg, Bipin Rajendran, Mehdi Asheghi, and Kenneth Goodson. 2010. Phase change memory. Proc. of the IEEE 98, 12 (2010), 2201--2227. DOI:10.1109/JPROC.2010.2070050Google ScholarCross Ref
- H. Xu, Y. Alkabani, R. Melhem, and A. K. Jones. 2016. FusedCache: A naturally inclusive, racetrack memory, dual-level private cache. IEEE Trans. Multi-Scale Comput. Syst. 2, 2 (Apr. 2016), 69--82. DOI:https://doi.org/10.1109/TMSCS.2016.2536020Google ScholarCross Ref
- Haifeng Xu, Yong Li, R. Melhem, and A. K. Jones. 2015. Multilane racetrack caches: Improving efficiency through compression and independent shifting. In Proceedings of the 20th Asia and South Pacific Design Automation Conference. 417--422. DOI:https://doi.org/10.1109/ASPDAC.2015.7059042Google Scholar
- See-Hun Yang, Kwang-Su Ryu, and Stuart Parkin. 2015. Domain-wall velocities of up to 750 m/s driven by exchange-coupling torque in synthetic antiferromagnets. Nat Nanotechnol. 10, 3 (2015), 221--6. DOI:10.1038/nnano.2014.324Google ScholarCross Ref
- HanBin Yoon. 2012. Row buffer locality aware caching policies for hybrid memories. In Proceedings of the IEEE 30th International Conference on Computer Design (ICCD’12). IEEE Computer Society, Washington, DC, 337--344. DOI:https://doi.org/10.1109/ICCD.2012.6378661Google ScholarDigital Library
- Hanbin Yoon, Justin Meza, Naveen Muralimanohar, Norman P. Jouppi, and Onur Mutlu. 2014. Efficient data mapping and buffering techniques for multilevel cell phase-change memories. ACM Trans. Archit. Code Optim. 11, 4 (Dec. 2014). DOI:https://doi.org/10.1145/2669365Google ScholarDigital Library
- Chao Zhang, Guangyu Sun, Weiqi Zhang, Fan Mi, Hai Li, and W. Zhao. 2015. Quantitative modeling of racetrack memory, a tradeoff among area, performance, and power. In Proceedings of the 20th Asia and South Pacific Design Automation Conference. 100--105. DOI:https://doi.org/10.1109/ASPDAC.2015.7058988Google ScholarCross Ref
- Y. Zhang, W. Zhao, J. Klein, D. Ravelsona, and C. Chappert. 2012. Ultra-high density content addressable memory based on current induced domain wall motion in magnetic track. IEEE Trans. Magnet. 48, 11 (Nov. 2012), 3219--3222. DOI:https://doi.org/10.1109/TMAG.2012.2198876Google ScholarCross Ref
- W. Zhao, N. Ben Romdhane, Y. Zhang, J. Klein, and D. Ravelosona. 2013. Racetrack memory-based reconfigurable computing. In Proceedings of the IEEE Faible Tension Faible Consommation. 1--4. DOI:https://doi.org/10.1109/FTFC.2013.6577771Google Scholar
- Ping Zhou, Bo Zhao, Jun Yang, and Youtao Zhang. 2009. A durable and energy efficient main memory using phase change memory technology. SIGARCH Comput. Archit. News 37, 3 (June 2009), 14--23. DOI:https://doi.org/10.1145/1555815.1555759Google ScholarDigital Library
Index Terms
- ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0
Recommendations
Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories
Special Issue on LCETES, Part 2, Learning, Distributed, and Optimizing CompilersTensor contraction is a fundamental operation in many algorithms with a plethora of applications ranging from quantum chemistry over fluid dynamics and image processing to machine learning. The performance of tensor computations critically depends on ...
Optimizing tensor contractions for embedded devices with racetrack memory scratch-pads
LCTES 2019: Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded SystemsTensor contraction is a fundamental operation in many algorithms with a plethora of applications ranging from quantum chemistry over fluid dynamics and image processing to machine learning. The performance of tensor computations critically depends on ...
TapeCache: a high density, energy efficient cache based on domain wall memory
ISLPED '12: Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and designDomain Wall Memory (DWM) is a recently developed spin-based memory technology in which several bits of data are densely packed into the domains of a ferromagnetic wire. DWM has shown great promise in enabling non-volatile memory with unprecedented ...
Comments