Old PBM763 (2020): Readings for Presentations

Topics Assignments Coq Proofs Presentations Sources

Individual reading assignments to prepare presentations.

It would be highly inefficient to search for prime numbers with a full search algorithm such as the Sieve of Eratosthenes or testing divisibility up to the square root. Nevertheless, large primes are important for cryptography and other applications. There are several efficient algorithms for primality testing.
  • The Sieve of Eratosthenes: An ancient algorithm to build a list of primes in a large interval.
  • Miller-Rabin primality test: Most practical and widely used algorithm is probabilistic. It may falsely claim that a composite number is prime with a small probability. The error can be made as small as you want, if you run sufficiently many tests.
  • Great Internet Mersenne Prime Search (GIMPS) project: The largest known primes have special form – they are named Mersenne Primes. They have even more efficient algorithms to check their primality (much faster than Miller-Rabin or Agrawal-Kayal-Saxena). Since 1996 there is an international activity to search for new Mersenne Primes - using computing power all around the world. This distributed computing effort served as a prototype for Bitcoin mining two decades later.
  • Agrawal-Kayal-Saxena primality test: For several decades there was no known deterministic efficient (polynomial time) algorithm to find primes. Miller-Rabin test did not count (since it is probabilistic). In 2002 the first algorithm to find primes efficiently was found.
Hash algorithms (one-way functions that map large domain of inputs to a smaller range of values) may include checksums, CRC and other simple computations that could ensure, for example, that there have been no errrors when transmitting files. On the other hand, Secure Hash Algorithms should ensure that nobody can infer any meaningful information from the hashed value, and most importantly – it should be prohibitively expensive to build “collisions” (find any two arguments that map to the same value). They are used to store user passwords and other information (that we need to check against, but do not want to store in cleartext). Historically MD5 algorithm has been widely used, but it turned out to be insecure (collisions are relatively easy to find). Therefore there have been SHA1, SHA2 (including various lengths of the hash values – SHA-256 is a member of SHA2 family), SHA3 and some others.
Unlike Lempel-Ziv or Burrows-Wheeler codes (optimal for human language where letter combinations and entire words often repeat), arithmetic coding relies on the probabilities of the messages. It aims to achieve the maximum compression rate possible due to Shannon’s enthropy theorem about the information content. Such codes are called “enthropy codes”. They include Huffman and Arithmetic coding. Arithmetic coding represents each sequence of messages as an interval of real numbers from the interval [0;1].
  • Arithmetic coding (Wikipedia): Different letters (also called “messages”) appear with different probabilities. Arithmetic coding uses the fact that it is more efficient to encode frequently occurring things with sending fewer bits. At every time moment the current message sequence is represented as a numeric interval. Whenever a new message arrives, the interval shrinks (if the new message is rare, it shrinks more). At every moment we can restore the sequence of the messages that generated our current interval.
  • Introduction to Data Compression by G.Blelloch (pp.16-22): How arithmetic coding fits into the larger picture of information compression; Shannon’s theorem on information content is proved. Also a discussion how to deal with the real number arithmetic on actual computers that enjoy integer numbers best.
  • Lecture 5 on Enthropy (David MacKay, Cambridge); start video at 34:30.: How probabilistic models that generate messages behave with respect to arithmetic coding.
  • Arithmetic Coding - Theory and Practice by Amir Said (Hewlett-Packard Labs): Theoretical introduction followed by lots and lots of implementation details. This shows how a simple idea (having neat proportions of real-number intervals) does not stay simple in real life.
  • Asymmetric numeral systems: ANS were developed about 2014 by Jarosław (Jarek) Duda at University of Kraków. It can speed up compression that is very similar to arithmetic coding by a factor of 10 or more. Such algorithms have become the basis for compression standards used by ZStandard, Draco 3D and multiple contemporary operating systems.
Searching for a substring in a larger string is needed for many uses – text editors, database selects, preventing leaks of sensitive data, genome biology. String search can be implemented in the most obvious way (as two nested loops), but it is usually too slow and inefficient. For this reason there are several improvements. In particular, they involve Knuth-Morris-Pratt (KMP) algorithm that compares strings forwards, but has a fancy “prefix function” showing by how much we can jump. And Boyer Moore (BM) algorithm that compares strings going backwards. In practice there are also many variations of these.
Plagiarism detection is a broader topic than just academic dishonesty. Perhaps, the broadest area of applications is document fingerprinting that tries to protect intellectual property (IP) by detecting every occurrence, when some document is sent out, if it contains a large chunk of some secret file. Plagiarism detection uses local similarity (finding large common substrings between the given text and the corpus of existing texts), lexical similarity (checking word frequencies or other statistics that distinguish style of different authors), and global similarity – all the way up to Machine Learning and text classification using Support Vector Machines and Neural Networks. (It might be too much for a single presentation, but plagiarism is also relevant for the EU Copyright Directive that mandates the use of Upload Filters – it may prevent people from uploading their OWN music and other works just because they have been detected by the plagiarism filter.)
  • Wikipedia: Plagiarism Detection: A reasonable overview of the methods used.
  • N-Gram Algorithms: N-Grams or sliding windows reads all the protected documents; extracts text; normalizes text (convert everything to lower case, replace punctuation by single spaces); takes a window of, say, 5 consecutive words and computes hash function. If the target document has many 5-word sequences that coincide with existing documents, it is strong evidence that there are common quotes (and these sequences did not occur just by random).
  • Using Rabin-Karp Algorithm: A very theoretical article on Rabin-Karp (which is originally a string searching algorithm) and its applications for plagiarism.
  • Rabin-Karp implementation: Rolling hash (similar to N-Gram idea) is used to compute Rabin-Karp function to do look-ups efficiently.
  • Finding similar sentences with Bert Service: Bert Service is an Artificial Intelligence solution for text classification tasks using Google’s Tensor Flow library. There are some limitations (it needs large training sets to be efficient, and it may also be slow), but there are some suggestions how it can be used to find similar text chunks efficiently.
  • Cross-Language Text Alignment for Plagiarism Detection: Plagiarism detection becomes much harder, if translation between languages is allowed. It usually means that the plagiarist takes an existing text, uses a machine translation service, edits the end-result a little bit – and then uses the plagiarized text in another language.
It turns out that many mathematicians use similar approaches to make their thinking process more creative and productive. It involves techniques of careful reading, imagining the provable goal, working backwards, visualization and more specific techniques. These include all types of “exhaustive search” (or “getting your hands dirty”), balancing your inductive hypotheses with its goal, “thinking globally, acting locally”, “making your life easier” (by replacing your problem with an easier one) and many more. Methods of mathematical creativity are usually discussed in books by authors as divers as Greek philosopher Plato (his dialogue Meno), George Polya, and Paul Zeitz. Many others have also contributed – since creativity is a complex and slightly subjective thing.
Bloom Filters explain how to store huge hashtables in a very limited space. For example, we want to have a huge lookup table (for a plagiarism detection device), but we do not want to transfer gigabytes of information whenever our reference data set changes. Bloom Filters provide an elegant way to compress hashtables by placing them into a “probabilistic” data structure. It is probabilistic, because it can occasionally have “false positives” - it can claim that some element is in the set (even though it is not there). Even Bloom Filters never have “false negatives” - if the data structure says that the string was not found, it is always true.
This presentation has to define what a regular expression is and the problem of searching them. (Regex finding is a feature for all major text editors; for example, Notepad++ or Gedit have this feature.) Regex patterns, their matching with finite state automata and performance of these matchers, including catastrophic backtracking.
Formulate four-color theorem (and similar theorems on Torus and other non-spheric topologies). Show how it can break down to analysis of a large number of cases.
Build and analyze a game-tree for a pen-and-paper that is feasible to analyze on a PC computer. Apply alpha-beta pruning algorithm to have a computer that wins human players.
Matrix Games can model decision taking in economy and other areas. It describes all the options that players can take; for each pair of choices it computes the outcome of the game, and places them in a big table called “matrix”. Distinguish zero-sum and non-zero-sum games. Show how to find Nash equilibrium; apply it to some simple number guessing games (Rock paper scissors; some variant of a Prisoner's dilemma; simultaneous guessing of positive integers.)
  • High-Low Guessing Game: Variations of a number guessing game (which would otherwise be a boring binary search problem).
  • Wikipedia: Prisoner’s Dilemma: How some matrix games (with non-zero-sum outcomes) produce paradoxes, where the global optimum for all players does not match the locally optimal strategy for each isolated player (if the strategies of other players are somehow fixed).
Define Hamiltonian cycles. Formulate theorem by Emanuels Grinbergs – the necessary condition for Hamiltonian cycles in planar graphs. Define concepts for NP complete problems and decision problems in general. Describe, how all NP-problems can be reduced to Hamiltonian cycles. The concepts of NP complete problems and decision problems. How the NP-complete problems can be reduced to/from Hamiltonian cycles.
A discussion how a (directed) graph of Webpage links can indicate, which pages are the most relevant (most often linked by other important pages). This topic can include other computable rankings of search results – including “recommendation problem”, where we have many users, some users have preferences that are well-known; the problem is to find similar users who might prefer the same things. This is used by Google, Netflix and many other service providers when there are thousands of options to choose from, but we want to recommend just the things that are most likely to match the given user’s preferences.
  • Google Page Rank: The page with the initial discussion how a page rank can be computed.
  • The Google Pagerank Algorithm and How It Works by Ian Rogers: Princeton University resource with some bibliography. It explains the theoretical concept, how it can be “cheated”, and what you should do (as a SEO – Search Engine Optimizer) to improve your page rank, getting better search results in Google without clear breaking of Google rules and being punished for dishonest self-promotion.
Russel's paradox, diagonalization results, Barber's paradox, Halting problem.
Results how voting among shareholders can lead either to a "dictator" deciding everything or to the voting process being easy to manipulate. This result is less relevant for national elections (because they do not happen often enough to allow manipulation scenarios described by Arrow), but this is important for shareholder decision making and other business applications where multiple users vote accordingly to their self-interest (but eventually the result of their voting turns out to be harmful to the self interest of the majority – no matter, how the voting procedure is organized). Arrow was awarded Nobel prize for this and other studies of collective decisions; why some naive intuitions about democracy may be wrong.

Unclaimed Stories